By: Ricardo B (ricardo.b.delete@this.xxxxx.xx), August 11, 2014 4:04 am
Room: Moderated Discussions
Aaron Spink (aaronspink.delete@this.notearthlink.net) on August 11, 2014 12:33 am wrote:
> Post decode caches/loop caches have been proposed and evaluated for a variety
> of architectures and shown performance improvement in all of them, IIRC. Its
> a general mechanism to decouple fetch/decode from dispatch/execute.
I think those are different things.
A small µOP buffer/loop cache, with a few tens of entries, is one thing.
It decouples F/D from D/E and/or it saves L1I$ accesses on tight loops, saving power and improving performance a bit.
We find practical examples of this in Intel Nehalem, SandyBridge onward, ARM A15 and probably others.
But the large µOP cache found in Sandy Bridge onward doesn't serve this purpose.
It's clearly there to bypass the power and restrictions of x86 decoding.
> Post decode caches/loop caches have been proposed and evaluated for a variety
> of architectures and shown performance improvement in all of them, IIRC. Its
> a general mechanism to decouple fetch/decode from dispatch/execute.
I think those are different things.
A small µOP buffer/loop cache, with a few tens of entries, is one thing.
It decouples F/D from D/E and/or it saves L1I$ accesses on tight loops, saving power and improving performance a bit.
We find practical examples of this in Intel Nehalem, SandyBridge onward, ARM A15 and probably others.
But the large µOP cache found in Sandy Bridge onward doesn't serve this purpose.
It's clearly there to bypass the power and restrictions of x86 decoding.