By: Exophase (exophase.delete@this.gmail.com), August 10, 2014 9:35 pm
Room: Moderated Discussions
anon (anon.delete@this.anon.com) on August 10, 2014 9:50 pm wrote:
> Yonah - decoded loop buffer far earlier than others. Most x86 designs now include something like this,
> including ones from AMD. Relatively few other ISA implementations seem to. ARM A15 was the first ARM
> to have smilar feature, but that's not even a highly regarded CPU in terms of efficiency, as far as
> ARM's track record goes. A12 and seemingly A17, which are the power efficient range, do not.
>
Actually, Cortex-A9 had a loop buffer, it's called "fast loop mode" and it can buffer loops up to 64 bytes large (so 16 ARM instructions or 16-32 Thumb-2 instructions):
http://infocenter.arm.com/help/topic/com.arm.doc.ddi0388i/Chddjech.html
A12 and A17 are basically iterations of A9, so as you would expect it too has a loop buffer:
http://infocenter.arm.com/help/topic/com.arm.doc.ddi0535b/BABDAHCE.html
But I can't find an exact size in there (and given that they went from A9 to A17 it looks like ARM is basically scrubbing A12 altogether)
There's some more mention of other implementations of loop buffers here:
http://pharm.ece.wisc.edu/papers/mitchell_hayenga_thesis.pdf
The loop buffers in Pentium-M and Core 2 were pre-decode (stored x86 instructions), the change to post-decode only came with Nehalem. I think you'll find that loop buffers are actually pretty common in anything power sensitive these days.
The reason why we started seeing them was because of an increased effort to save power, not really as a performance optimization. It's more efficient to look up into a linear contiguous buffer than a cache, and since you want a prefetch buffer anyway you can already repurpose it as a pre-decode loop buffer. Or you can sort of repurpose a ROB or similar into a post-decode loop buffer (AFAIK Silvermont does this). But CPUs will spend a ton of time outside of even a large loop buffer, so you really can't rely on it delivering higher performance than the traditional fetch/decode path.
> Yonah - decoded loop buffer far earlier than others. Most x86 designs now include something like this,
> including ones from AMD. Relatively few other ISA implementations seem to. ARM A15 was the first ARM
> to have smilar feature, but that's not even a highly regarded CPU in terms of efficiency, as far as
> ARM's track record goes. A12 and seemingly A17, which are the power efficient range, do not.
>
Actually, Cortex-A9 had a loop buffer, it's called "fast loop mode" and it can buffer loops up to 64 bytes large (so 16 ARM instructions or 16-32 Thumb-2 instructions):
http://infocenter.arm.com/help/topic/com.arm.doc.ddi0388i/Chddjech.html
A12 and A17 are basically iterations of A9, so as you would expect it too has a loop buffer:
http://infocenter.arm.com/help/topic/com.arm.doc.ddi0535b/BABDAHCE.html
But I can't find an exact size in there (and given that they went from A9 to A17 it looks like ARM is basically scrubbing A12 altogether)
There's some more mention of other implementations of loop buffers here:
http://pharm.ece.wisc.edu/papers/mitchell_hayenga_thesis.pdf
The loop buffers in Pentium-M and Core 2 were pre-decode (stored x86 instructions), the change to post-decode only came with Nehalem. I think you'll find that loop buffers are actually pretty common in anything power sensitive these days.
The reason why we started seeing them was because of an increased effort to save power, not really as a performance optimization. It's more efficient to look up into a linear contiguous buffer than a cache, and since you want a prefetch buffer anyway you can already repurpose it as a pre-decode loop buffer. Or you can sort of repurpose a ROB or similar into a post-decode loop buffer (AFAIK Silvermont does this). But CPUs will spend a ton of time outside of even a large loop buffer, so you really can't rely on it delivering higher performance than the traditional fetch/decode path.