By: anon (anon.delete@this.anon.com), August 11, 2014 4:20 am
Room: Moderated Discussions
Exophase (exophase.delete@this.gmail.com) on August 10, 2014 10:35 pm wrote:
> anon (anon.delete@this.anon.com) on August 10, 2014 9:50 pm wrote:
> > Yonah - decoded loop buffer far earlier than others. Most x86 designs now include something like this,
> > including ones from AMD. Relatively few other ISA implementations seem to. ARM A15 was the first ARM
> > to have smilar feature, but that's not even a highly regarded CPU in terms of efficiency, as far as
> > ARM's track record goes. A12 and seemingly A17, which are the power efficient range, do not.
> >
>
> Actually, Cortex-A9 had a loop buffer, it's called "fast loop mode" and it can buffer
> loops up to 64 bytes large (so 16 ARM instructions or 16-32 Thumb-2 instructions):
>
> http://infocenter.arm.com/help/topic/com.arm.doc.ddi0388i/Chddjech.html
>
> A12 and A17 are basically iterations of A9, so as you would expect it too has a loop buffer:
>
> http://infocenter.arm.com/help/topic/com.arm.doc.ddi0535b/BABDAHCE.html
>
> But I can't find an exact size in there (and given that they went from
> A9 to A17 it looks like ARM is basically scrubbing A12 altogether)
>
> There's some more mention of other implementations of loop buffers here:
>
> http://pharm.ece.wisc.edu/papers/mitchell_hayenga_thesis.pdf
>
> The loop buffers in Pentium-M and Core 2 were pre-decode (stored x86 instructions),
> the change to post-decode only came with Nehalem. I think you'll find that loop
> buffers are actually pretty common in anything power sensitive these days.
Thanks for the correction. Obviously I thought they were post-decode, which is what A15's loop buffer is.
>
> The reason why we started seeing them was because of an increased effort to save power, not really
> as a performance optimization.
Much the same thing, these days.
> It's more efficient to look up into a linear contiguous buffer than
> a cache, and since you want a prefetch buffer anyway you can already repurpose it as a pre-decode
> loop buffer. Or you can sort of repurpose a ROB or similar into a post-decode loop buffer (AFAIK
> Silvermont does this). But CPUs will spend a ton of time outside of even a large loop buffer, so
> you really can't rely on it delivering higher performance than the traditional fetch/decode path.
> anon (anon.delete@this.anon.com) on August 10, 2014 9:50 pm wrote:
> > Yonah - decoded loop buffer far earlier than others. Most x86 designs now include something like this,
> > including ones from AMD. Relatively few other ISA implementations seem to. ARM A15 was the first ARM
> > to have smilar feature, but that's not even a highly regarded CPU in terms of efficiency, as far as
> > ARM's track record goes. A12 and seemingly A17, which are the power efficient range, do not.
> >
>
> Actually, Cortex-A9 had a loop buffer, it's called "fast loop mode" and it can buffer
> loops up to 64 bytes large (so 16 ARM instructions or 16-32 Thumb-2 instructions):
>
> http://infocenter.arm.com/help/topic/com.arm.doc.ddi0388i/Chddjech.html
>
> A12 and A17 are basically iterations of A9, so as you would expect it too has a loop buffer:
>
> http://infocenter.arm.com/help/topic/com.arm.doc.ddi0535b/BABDAHCE.html
>
> But I can't find an exact size in there (and given that they went from
> A9 to A17 it looks like ARM is basically scrubbing A12 altogether)
>
> There's some more mention of other implementations of loop buffers here:
>
> http://pharm.ece.wisc.edu/papers/mitchell_hayenga_thesis.pdf
>
> The loop buffers in Pentium-M and Core 2 were pre-decode (stored x86 instructions),
> the change to post-decode only came with Nehalem. I think you'll find that loop
> buffers are actually pretty common in anything power sensitive these days.
Thanks for the correction. Obviously I thought they were post-decode, which is what A15's loop buffer is.
>
> The reason why we started seeing them was because of an increased effort to save power, not really
> as a performance optimization.
Much the same thing, these days.
> It's more efficient to look up into a linear contiguous buffer than
> a cache, and since you want a prefetch buffer anyway you can already repurpose it as a pre-decode
> loop buffer. Or you can sort of repurpose a ROB or similar into a post-decode loop buffer (AFAIK
> Silvermont does this). But CPUs will spend a ton of time outside of even a large loop buffer, so
> you really can't rely on it delivering higher performance than the traditional fetch/decode path.