Starting Point: The Design Willamette Would Replace
To understand the significance of the Willamette one must examine the design that it is intended to replace. Five years ago the P6 core was first delivered in the form of the Pentium Pro high-end processor for technical workstations and servers. The P6 was a remarkable achievement, an out-of-order execution superscalar x86 processor whose integer performance briefly eclipsed the fastest RISC processors. Intel went on to sell about 7 million Pentium Pro’s, a huge figure that would represent the jack pot for most high end processor families but is still quite modest by mainstream x86 standards.
But Intel had great plans for this versatile core. The P6 core rapidly proliferated into every market Intel targeted from the low end Celeron, the mainstream Pentium II and III, and the high end Xeons. The P6 core has been implemented in five different processes and had two major instruction set extensions as shown in Table 1.
|1995||0.5 um BiCMOS||–||“P6” Pentium Pro|
|1995||0.35 um BiCMOS||–||“P6” Pentium Pro|
|1997||0.35/0.28 um CMOS||MMX||“Klamath” Pentium II|
|1998||0.25 um CMOS||MMX||“Deschutes” Pentium II / Xeon|
|1999||0.25 um CMOS||MMX, KNI/SSE||“Katmai” Pentium III / Xeon|
|1999||0.18 um CMOS||MMX, KNI/SSE||“Coppermine” Pentium III / Xeon|
For the last several years the P6 core has been at the heart of just about every processor Intel has sold and is responsible for over $20 billion in annual sales and billions in profit for the chip giant. The basic design of the P6 core is shown in Figure 1. It is shown in its latest incarnation, the Pentium III, with 16 KB L1 caches and MMX and KNI/SSE functional unit extensions (the original Pentium Pro P6 design had 8 KB L1 caches)
Figure 1 Organization of the P6 Processor Core
The primary characteristic of the P6 core is that it can decode up to one complex and two simple x86 instructions per clock cycle. The P6 instruction decoders effectively translate x86 instructions into one or more simpler operations encoded into control information parcels known as micro-ops or uops. A uop is a fixed length, 118 bit long control word that encodes an operation, two operand sources, and a result destination. The source and destination fields were wide enough to include a complete 32-bit operand value such as an immediate value or destination address offset. Uops are fed into a reorder buffer, which is a functional unit that tracks the overlapped out-of-order (OOO) execution of up to 40 uops at once. Although the three decoders could theoretically generate up to 6 uops per clock cycle, the reorder buffer can only accept, process, and output 3 uops per cycle to the reservation station.
The OOO execution engine can issue up to three uops per clock cycle, each to one of five execution resources – two execution units, two address generation units, and a store data unit. When the MMX, and later on the KNI/SSE extensions, were added to the x86 instruction set, this basically expanded the two execution units. The port 0 execution unit originally supported just integer and basic x87 type FP instructions. It was extended to support MMX instructions and also the SIMD FP multiplication functions of KNI/SSE. The port 1 execution unit supported just integer instructions in the Pentium Pro but was later extended to support MMX, and ultimately SIMD FP addition functions of KNI/SSE.
Figure 2 Basic P6 Execution Pipeline
The basic pipeline organization of the P6 design is shown in Figure 2. Simple instructions take a minimum of 12 clock cycles to flow through the pipeline. That is the minimum because there is the possibility of delays in the reservation station between pipe stages 8 and 9, or in the reorder buffer between pipe stages 10 and 11. Each x86 instruction executed on a P6 generates between 1.5 and 2.0 uops, with the preponderance of code closer towards 1.5 uops per instruction when running typical PC type applications. When running the SPEC95 benchmark suite, the uop per instruction figure ranges from 1.2 to 1.7 with an average around 1.35. This may indicate that the Intel reference compiler, typically used for benchmarking, uses more of a RISC-like code generation strategy that favors register to register instructions than the compilers typically used by commercial application developers. The average uops per instruction and average clock cycles per instruction (CPI) for several applications and benchmarks are shown in Table 2.
|Benchmark||Uops/Instruction||Cycles per Instruction (CPI)|
Although uops can go through the P6 pipeline in as few as 9 to 10 cycles, about 90% of them take 12 to 14 clock cycles from decode to retirement. The uop lifetime distribution is actually bimodal with a minor lobe way out at 50 to 120+ cycles (depending on the processor clock to bus clock frequency ratio and the latency of main memory) for memory and memory dependent uops that miss both level of caches. There is also a bimodal distribution of uop delay passing through the reservation station. About half of the uops pass through in the minimum time while the other half are delayed 3 extra cycles waiting for the result of a preceding ALU operation. According to Bob Colwell, the P6 Architecture Manager, when all of these second order effects are taken into account the so-called 12 pipe stage P6 has an effective length of 15 to 20 cycles for integer instructions and 30 or so for floating point instructions.
Be the first to discuss this article!