More Functional Units and Faster Clock?
The central mystery surrounding McKinley is how Intel and HP could start with the Itanium design, add more functional units, shorten the basic execution pipeline by 30%, and yet achieve a 50% higher clock frequency in the same 0.18 um process. One important factor is the additional functional units that I propose in my McKinley model simplify the process of dispersing instructions from the instruction fetch unit to functional units. Figure 1 shows that in Itanium an individual instruction slot channel can connect to up to five possible functional unit feed points, while an individual functional unit can be fed from up to four possible slot channels. That is to say, the Itanium instruction dispersal network has maximum slot fan-out of five, and maximum functional unit fan-in of four. In contrast, the hypothetical McKinley model shown in Figure 3 has an instruction dispersal network with a maximum slot fan-out of four and a maximum functional unit fan-in of two (one for non F-units). Given proper chip floor-planning, it may be possible to design the hypothetical McKinley instruction dispersal network to operate significantly faster than that of Itanium in the same process technology.
Where the extra functional units in McKinley could really hurt clock frequency is the associated requirement to increase the number of read and write ports in the IA64 integer general purpose register file. This is particularly true for the two extra M-units, each of which requires two read ports and two write ports (IA64 load and store instructions optionally modify the base register). The Itanium integer GPR file has 8 read and 6 write ports, and register access was spread over three of the ten stages in the basic execution pipeline . At first glance, it appears that supporting twice the number of integer and memory units would require an integer GPR file with 16 read and 12 write ports. This requirement can be reduced architecturally by the realization that the four I-units and four M-units in the hypothetical McKinley execution model can never all execute in parallel. The worst-case scenario is the dual issue and execution of two MMI bundles, which requires 12 read and 10 write ports. To achieve this savings requires that GPR ports be assigned to instruction slots rather than to functional units.
A more radical design approach would be to move away from a non-blocking model for the GPR file and exploit the fact that it is quite rare for all GPR ports to be required for any given dual bundle issue. Having somewhat fewer ports, and a means for functional units to arbitrate for them, means that a simpler GPR design can be used at the cost of a small statistical loss in average performance from the infrequent cases where the ports are oversubscribed and extra stall cycles are incurred. But it in general, the small performance benefit from minor simplification of the register file is rarely worth the increased processor control logic and register file address path complexity.
Register file performance can also be improved through the realization that register file access time is usually dominated by the number of read ports (an individual register can be written from only one write port in a given cycle but may be read from all read ports simultaneously in the worse case). This fact can be exploited by realizing a logical 12 read and 10 write port GPR file as two separate physical 6 read and 10 write port GPR files that are always updated in lockstep. Although the duplication of a highly ported 128 x 65b register file sounds wasteful, it represents a relatively small amount of die area in a processor on the scale of McKinley. In fact, duplicated register files can actually be helpful in the layout of highly parallel data paths by permitting functional unit clustering, as was done in the Alpha EV6 processor core. Finally, it must be pointed out that the transistor level design of a processor element as basic as a register file is subject to continuous refinement and improvement as new process technologies present both new problems and new opportunities to circuit designers .
The doubling of M-units to four in my hypothetical McKinley design also poses a problem for the data cache. Although SRAM can be multiple ported at the memory cell level, like register file elements this causes performance problems as well as a significant decrease in bit density. Alternative methods of implementing multiple access ports in a data cache include double pumping the SRAM arrays (Alpha EV6) and duplicating the SRAM arrays (Alpha EV5). Pseudo multi-porting can be achieved through the use of multiple access paths into a multi-banked cache SRAM (AMD K7). To achieve the four ported data cache that my hypothetical McKinley model requires would likely be best accomplished by either quad porting at the cell level or with duplicated cache arrays dual ported at the cell level.
One rumor surrounding McKinley is that its data cache supports a one cycle load-use latency. If you consider the contortions the Pentium 4 designers had to go through to get even an 8 KB cache to operate at 2 cycle load-use latency at GHz clock rates in a 0.18 um process, one cycle latency seems unlikely. Although the task is eased by the fact that the McKinley’s maximum clock rate is about 60% of the Pentium 4, the main factor that makes this claim credible is courtesy of the IA64 ISA. I have criticized IA64 in the past for a lack of register offset addressing. This increases average code size, especially for integer applications that make extensive use of composite data structures. In the case of pipeline design for fast data cache accesses, IA64’s exclusive use of register indirect addressing allows the effective address generation pipeline stage found in most CISC and RISC processor designs to be eliminated (an adder is still required to perform the optional base register update, it just isn’t on the critical path). This idea is shown in Figure 4.
Figure 4 Elimination of EAG step in IA64 Data Cache Access
Keep in mind that in many data access code sequences an address offset calculation cannot be avoided, and this requires the use of a separate, explicit add immediate instruction on IA64 processors to calculate the effective address. If the McKinley does support one cycle data cache load-use latency it would be an important factor in improving IA64 performance, especially on integer codes, because in-order MPUs generally do not tolerate data latency as well as out-of-order execution designs.
To keep the data cache as simple and fast as possible the McKinley will likely stick with the Itanium’s choice of directing FP loads and stores directly to the L2 cache, bypassing the data cache completely. This also allows the data cache to be located in the most advantageous position possible for fast access from the integer and memory functional units without concern for the length of the data path between the data cache and FP units.
McKinley is noteworthy for being the first microprocessor to implement three levels of cache hierarchy on chip. The first, second, and third level caches are optimized for low latency, high bandwidth, and high capacity respectively . Although the size of the caches haven’t been disclosed, there have been reports that the L3 cache is astoundingly large, 2.5 to 3.0 Mbytes in size. Both the overall transistor count attributed to McKinley and practical examples like the PA-8700 suggest that this is unlikely. One possible way McKinley could have such a massive L3 cache is through the use of a 4T SRAM cell. Unlike the classic 4T SRAM that used to dominate commodity SRAM devices, McKinley would use a modern variant that eliminates the need for internal pull-up resistors. The memory cell regenerates using current sourced from the precharged bitlines that is deliberately leaked through the access devices to retain the internal state. This cell is very small but it requires careful circuit and process engineering.
Be the first to discuss this article!