By: Paul A. Clayton (paaronclayton.delete@this.gmail.com), February 4, 2013 12:20 pm
Room: Moderated Discussions
Patrick Chase (patrickjchase.delete@this.gmail.com) on February 4, 2013 10:05 am wrote:
[snip]
> You can mitigate such decode power penalties by using a first-level Icache that contains
> uops instead of instructions. It only works if you're using physical register files (as opposed
> to reservation stations), though, because otherwise the uop size is unmanageable.
Why would µop size be any different based on the form of renaming?
Unfortunately, while a decoded instruction cache reduces decode power from decoding instructions (assuming reuse of decoded instructions), it increases power use for storage. While decode energy use would generally dominate, if a sequence of instructions is used only once in its lifetime in the µop cache, then there may be an net increase in energy use (It is not clear that current implementations support cache bypassing for such cases--or that the added complexity of providing such would be at all worthwhile.). The computation vs. storage/communication tradeoffs seem to increasingly favor computation, so the benefits of a sparse µop can may decrease. (This change in tradeoffs would likewise seem to disadvantage classic fixed-size instruction RISCs. [I favor variable length encoding. A good VLE can be only modestly more complex than fixed length, can provide 20+% better code density {with bandwidth and cache size power and cost benefits--and storage benefits for some embedded systems}, and facilitates instruction set extension. Lower overhead for large constants may also be an advantage for VLE.)
Decoded instruction caches (especially µop caches) seem to have significant potential for simplifying some optimizations, so their benefit may increase at the high end.
> Interestingly enough, every Intel OoO CPU with PRFs has had some form of first-level uop cache (P4,
> Sandy Bridge, Ivy Bridge, Haswell). Do you think that perhaps they've figured this one out? :-)
The correlation of PRFs and µop caches might have more to do with increased emphasis on power efficiency than with requirements of PRFs for reasonable-sized µops.
(It has been speculated that eventually an x86 processor will use variable length µops. This is perhaps already somewhat the case at issue queues when longer immediates are stored separately--if one counts the immediate as part of the µop. Similar techniques might be applied to a µop cache. There might also be some opportunity for improved density with minimal additional further decode complexity outside of extracting longer immediates, but that is less clear to me.)
[snip]
> You can mitigate such decode power penalties by using a first-level Icache that contains
> uops instead of instructions. It only works if you're using physical register files (as opposed
> to reservation stations), though, because otherwise the uop size is unmanageable.
Why would µop size be any different based on the form of renaming?
Unfortunately, while a decoded instruction cache reduces decode power from decoding instructions (assuming reuse of decoded instructions), it increases power use for storage. While decode energy use would generally dominate, if a sequence of instructions is used only once in its lifetime in the µop cache, then there may be an net increase in energy use (It is not clear that current implementations support cache bypassing for such cases--or that the added complexity of providing such would be at all worthwhile.). The computation vs. storage/communication tradeoffs seem to increasingly favor computation, so the benefits of a sparse µop can may decrease. (This change in tradeoffs would likewise seem to disadvantage classic fixed-size instruction RISCs. [I favor variable length encoding. A good VLE can be only modestly more complex than fixed length, can provide 20+% better code density {with bandwidth and cache size power and cost benefits--and storage benefits for some embedded systems}, and facilitates instruction set extension. Lower overhead for large constants may also be an advantage for VLE.)
Decoded instruction caches (especially µop caches) seem to have significant potential for simplifying some optimizations, so their benefit may increase at the high end.
> Interestingly enough, every Intel OoO CPU with PRFs has had some form of first-level uop cache (P4,
> Sandy Bridge, Ivy Bridge, Haswell). Do you think that perhaps they've figured this one out? :-)
The correlation of PRFs and µop caches might have more to do with increased emphasis on power efficiency than with requirements of PRFs for reasonable-sized µops.
(It has been speculated that eventually an x86 processor will use variable length µops. This is perhaps already somewhat the case at issue queues when longer immediates are stored separately--if one counts the immediate as part of the µop. Similar techniques might be applied to a µop cache. There might also be some opportunity for improved density with minimal additional further decode complexity outside of extracting longer immediates, but that is less clear to me.)