By: Patrick Chase (patrickjchase.delete@this.gmail.com), February 4, 2013 12:48 pm
Room: Moderated Discussions
Paul A. Clayton (paaronclayton.delete@this.gmail.com) on February 4, 2013 12:20 pm wrote:
> Patrick Chase (patrickjchase.delete@this.gmail.com) on February 4, 2013 10:05 am wrote:
> [snip]
> > You can mitigate such decode power penalties by using a first-level Icache that contains
> > uops instead of instructions. It only works if you're using physical register files (as opposed
> > to reservation stations), though, because otherwise the uop size is unmanageable.
>
> Why would µop size be any different based on the form of renaming?
In a classic reservation station based OoO design the uop (the thing that is sent from the issue unit to the RS after renaming) contains either:
1. The literal value of each input operand if it is available at the time of issue
2. The identity of the reservation station that will produce the input operand, if it is unavailable at the time of issue. The reservation station used this information to "capture" the operand when it is subsequently broadcast to the common result bus (and that's why the result bus is a power-intensive part of a Tomasulo machine).
That makes for a rather large uop. In contrast in the PRF style the uop contains the ID of the physical register. I don't know the uop sizes for P4 or "the bridges", but I'd bet they're considerably smaller than the original P6's reported 118 bits.
Even in the PRF case there has to be some "fixup" of the uop coming out of the I-cache (for example the physical register IDs must change based on the state of the IRAT), so it's not possible to cache "pure" uops. It would presumably be possible to define an intermediate "predecoded" format that could be used in an RS-based design. For some reason I don't know Intel never did so, though. They've very consistently used uop caches in the PRF designs but not in their RS designs.
-- Patrick
> Patrick Chase (patrickjchase.delete@this.gmail.com) on February 4, 2013 10:05 am wrote:
> [snip]
> > You can mitigate such decode power penalties by using a first-level Icache that contains
> > uops instead of instructions. It only works if you're using physical register files (as opposed
> > to reservation stations), though, because otherwise the uop size is unmanageable.
>
> Why would µop size be any different based on the form of renaming?
In a classic reservation station based OoO design the uop (the thing that is sent from the issue unit to the RS after renaming) contains either:
1. The literal value of each input operand if it is available at the time of issue
2. The identity of the reservation station that will produce the input operand, if it is unavailable at the time of issue. The reservation station used this information to "capture" the operand when it is subsequently broadcast to the common result bus (and that's why the result bus is a power-intensive part of a Tomasulo machine).
That makes for a rather large uop. In contrast in the PRF style the uop contains the ID of the physical register. I don't know the uop sizes for P4 or "the bridges", but I'd bet they're considerably smaller than the original P6's reported 118 bits.
Even in the PRF case there has to be some "fixup" of the uop coming out of the I-cache (for example the physical register IDs must change based on the state of the IRAT), so it's not possible to cache "pure" uops. It would presumably be possible to define an intermediate "predecoded" format that could be used in an RS-based design. For some reason I don't know Intel never did so, though. They've very consistently used uop caches in the PRF designs but not in their RS designs.
-- Patrick