Firing Well on Both Cylinders
The hypothetical execution model I propose for McKinley is shown in Figure 3. Like Itanium, this model is a dual bundle issue IA64 processor. The primary ways it differs from Itanium are the addition of two extra M-units (M2 and M3), two extra I-units (I2 and I3), and the resultant simplification of the instruction dispersal network.
Figure 3 Hypothetical McKinley Execution Model
The most immediate impact of these changes is two-fold. The primary effect of the extra execution units is to reduce the occurrence of split issue of bundle pairs due to functional unit oversubscription. This is clearly shown in Table 4.
Table 4 Issue Capability of Hypothetical McKinley Execution Model
The fraction of bundle pairs that can be fully dual issued (indicated by an entry of “6”) has gone up from 16 – 27 % for Itanium to 90% for the hypothetical McKinley execution model. Keep in mind that the occurrence distribution of bundle pairs combinations in actual code will be non-uniform and will vary tremendously from program to program).
If the Itanium averages 25% dual issue and the hypothetical McKinley averages 90% dual issue when execution is not stalled, then the average issue IPC is improved by about 48% (5.70 vs. 3.85) assuming uniform bundle pair format distribution. The greater instruction issue efficiency plus improved cache hierarchy can explain McKinley’s dramatically improved performance on integer applications, but how does my model explain a doubling in FP performance from a 50% increase in clock rate when it still has two F-units? The key here is to remember that even though the Itanium offers competitive FP performance it is a highly bandwidth constrained design. The potential tripling of memory bandwidth supported by the McKinley system interface can easily explain the balance of the increase in FP performance beyond simple clock rate scaling.
Be the first to discuss this article!