Pentium 4 (Part 1): The Clock Factory
Intel presented two papers that revealed technical details of the Pentium 4’s clocking scheme and the design of its well-publicized double frequency integer arithmetic logic units (ALUs). The 0.18 um Pentium 4, developed under the code name Willamette, has a rather intricate clocking scheme. It accepts a 100 MHz bus clock input, which is used as the reference input to two PLLs, one for the processor core, and one for chip input/output (I/O). The I/O PLL generates a 400 MHz clock, which is used to control the timing of outbound (write operation) data and the associated source synchronous strobe signals driven out the ‘quad pumped’ system interface. The core PLL is used to generate the time base for three different clock frequencies used by the Pentium 4 processor core. In the case of a 1.5 GHz device the core PLL generates a 1.5 MHz ‘core clock’. This clock is distributed throughout the processor core using a triple 3-stage binary tree of clock repeaters as shown in Figure 3.
Figure 3. Pentium 4 Clock Generation and Distribution Scheme
The clock tree drives 47 domain buffers distributed throughout the chip. The output of the domain buffers is a 1.5 GHz clock called GCLK. Each local GCLK drives a number of local clock macros that generate the actual clock signals used by flip-flops and latches. There are a variety of different local clock macro elements used in the Pentium 4. Besides incorporating the conventional gating and pulse stretching features used for dynamic power management and timing problem diagnosis, the clock macros are also capable of generating the 3.0 GHz pulsed fast clock (FCLK) used by the famous double frequency ALUs, pulsed and conventional versions of the 1.5 GHz medium clock (MCLK) used by the majority of the processor logic, and the 750 MHz slow clock (SCLK) used by the trace cache and bus interface unit. The derivation of the various processor clocks from GCLK by the local clock macros is shown in Figure 4. Pulsed versions of MCLK and SCLK are provided due to the extensive use of pulsed latches as flip-flops to save area and power, and reduce pipeline overhead.
Figure 4. Processor Clocks and Their Derivation from GCLK
The Pentium 4 also incorporates advanced features for custom deskewing each device during testing. Each of the 47 domain clock buffers incorporate delay adjustment capabilities using a programmable delay element controlled by a 5 bit register. The clock distribution system also includes 46 phase comparator circuits placed between adjacent domain clocks that are observable from a common test access port. This allows the 47 processor clock domains within the processor to be deskewed during test using a binary search algorithm. When this process is completed, the delay setting for each domain buffer is permanently programmed using fuse arrays. The inter-domain clock skew of a raw Pentium 4 device may exceed 60 ps, but after domain buffer deskewing that figure can be reduced to about 16 ps. The reduced level of clock skew can increase maximum operating frequency by up to 10%. This programmable domain clock buffer scheme also provides the capability of deliberately introducing controlled clock skew between various regions of the processor core. By passing timing slack from pipeline stages that have short logic delays to pipe stages with the longest logic delays, it is possible to further raise the maximum operating frequency. Intel reported that tests with early silicon samples showed that devices could be promoted by up to one full speed bin using this technique.
Discuss (83 comments)