Pentium 4 (part 2): How it Adds Up
The Pentium 4 may be best known for operating its ALUs at twice the frequency of its already very high processor clock rate. This capability has been the inspiration of much speculation about how this was achieved by many people including myself (See What’s Up With Willamette – Part 2). The actual approach Intel took uses a remarkably straightforward logic design. The so-called fast ALUs and load address generator (AGU) operate by staggering operations across two 16 bit ALU slices and flag evaluation logic in three fast clock periods as shown in figure 5. Each 16-bit slice achieves a very short propagation delay using globally reset, self-timed pre-charge domino logic. The ALU requires 8 inverting logic levels – 4 dynamic and 4 static, equivalent to 4 full domino stages. The first six logic stages of the ALU evaluate, while the last dynamic stage and output buffer are reset. During the second phase of the FCLK period the first 6 logic stages are reset in a ripple fashion, even as the final dynamic stage evaluates and drives the result out through the buffer. Hold time for the fast bypass path is guaranteed by the inclusion of a ‘jam latch’ on the final dynamic logic stage node.
Figure 5 Pentium 4 Staggered ALU Conceptual Design
Figure 5 shows a conceptual, rather than actual, representation of the fast ALU because the input latches on the upper 16 bit ALU slice and flag logic aren’t really present within the ALU, but instead are placed in a multi-stage bypass network. This network, shown in figure 6, performs the staggering of operands coming into the fast ALU section (from the register file or output of long latency functional units like the shifter and multiplier), and the coalescing of results leaving the fast ALU cluster (to the register file or inputs to a long latency functional unit). Arbitrarily long chains of data dependent additions, subtractions, bitwise logical, and sign extend operations can be executed within the fast ALU cluster at a rate of one every FCLK cycle or half a processor clock (MCLK) period.
Figure 6 Implementation of Fast ALU Cluster
The load operation exploits the fact that the 8 KB data cache can be read safely (albeit speculatively) using the low 16 bits of the effective address. The calculation of the upper 16 bits of the address, the virtual to logical address mapping, and the cache tag checking are performed outside of the critical fast logic section and after load data is consumed by another operation. The Pentium 4 handles load exceptions (TLB miss, cache miss, access violation etc.) by being able to suppress and optionally rerun any load and dependent ALU operation that has found to have consumed invalid data from the cache. Load-to-use latency for memory reads is 4 FCLK cycles, and loads can be initiated in any FCLK period if the cache was idle the previous period.
The operation of the staggered ALU on a sequence of three dependent integer operations: A added to B, B OR-ed to C, and D subtracted from C) is shown in figure 7. The multistage bypass network takes the register operands A, B, C, and D from the register file and feeds them into the fast ALU cluster with the upper 16 bits of operands delayed by one FCLK cycle. The updated values of B and C are bypassed directly into inputs of the second and third operation respectively.
Figure 7. Staggered Operation of the Pentium 4 Fast ALU
The three operations take 3 FCLK cycles to perform from the viewpoint of feeding subsequent operations within the fast ALU cluster. An extra FCLK cycle is required to gather the lower and upper 16 bits of the final value of B for writing back to the register file or being passed onto a long latency ALU operation like a shift or multiply. Obviously, for maximum performance on the Pentium 4 the compiler must use a code generation strategy that tries to keep the longest chains of dependent operations entirely within the fast ALU cluster, and avoid cross feeding results and operands back and forth between the fast ALU cluster and the long latency execution units. According to Intel, 60% to 70% of instructions executed contain operations that can be executed in the fast ALU cluster while only about 1 to 2% of instructions perform long latency integer operations. The caveat here is that these figures are likely for the case of code compiled explicitly for the Pentium 4. In addition, the dynamic instruction frequency breakdown will still vary significantly from program to program. For example, many cryptographic applications make extensive uses of shifts and/or multiplies, both long latency operations on the Pentium 4.
Discuss (83 comments)