Shortening the Pipeline
The most striking difference between McKinley and Itanium/Merced is the fact that the newer design can clock 50% faster even with a shorter basic execution pipeline than the Itanium. The 10 stage pipeline of Itanium are shown in Figure 5 along with a hypothetical apportionment of the 7 stage McKinley pipeline.
Figure 5 Itanium and Hypothetical McKinley Execution Pipeline
The first pipe stage of the Itanium is the IP generate, or IPG stage. The next instruction pointer (IP) is selected from a variety of branch predictors and branch address calculation logic that resolve in the 1st, 2nd, 3rd, and 4th pipeline stages. This ad hoc arrangement results in branch penalties of 0, 1, 2, or 3 clock cycles for correctly predicted branches. Branches that mispredict incur a 9 cycle penalty.
To shorten the front end of the execution pipeline and reduce branch overhead the designers of McKinley could employ an autonomous fetch unit driven by extra fields in each instruction cache line that point to the next predicted cache line. This also removes the overhead of IP selection from the pipeline for the most common case of successful branch prediction. The Alpha EV6 uses this mechanism to avoid multi cycle penalties for correctly predicted branches. If employed in an IA64 processor it could eliminate the need for the eight entry instruction bundle queue that occupies Itanium’s third pipeline stage.
Where there is likely a large opportunity to improve on the Itanium microarchitecture is in the implementation of the complicated register renaming scheme that IA64 employs. IA64 divides the integer register file, gr0 through gr127, into static registers, gr0 through gr31, and stacked registers, gr32 through gr127. The latter are accessed using an offset that is adjusted during subroutine entry and exit to provide an overlapping window for parameter passing. While SPARC uses a fixed size overlap increment and manages window overflow and underflow using software traps, the IA64 register stack scheme permits variable size region allocation, and register file overflow and underflow are handled by an autonomous piece of hardware called the register stack engine (RSE). IA64 also permits a variable size portion of the stacked registers to be rotated in the response to certain types of branches. A direct implementation of integer GPR renaming can be built using a straightforward arrangement of 7 bit comparators, adders, and multiplexors. Given the design approach taken elsewhere within Itanium, it wouldn’t surprise me if it used a direct implementation. There are faster and more clever ways to implement IA64 register renaming and it is likely that the McKinley uses one of them.
Be the first to discuss this article!