How can FO4 be used?
At this time, we have in essence proclaimed the use of FO4 as a generic, process neutral metric that we can use in various architectural discussions as a first order approximation of expected circuit delays in lieu of exacting process-specific SPICE simulation results. So in figure 2, we apply the metric to a generic logic block to examine the benefits and consequences of attempts to pipeline this logic block. We assume that there exists a logic block with the basis of 100 FO4 delays. This logic block may compute PI to the 10th digit in a single cycle, but it is a slow circuit. A processor that contains this logic block in the critical timing path will have a maximum operating frequency that is proportional to the inverse of the sum of the logic block delay and latch overhead as measured with the FO4 metric. In case 2 of figure 2, we show that if we now divide the logic block into roughly equivalent stages and insert a bank of latches to hold temporary values at the interface of the sub blocks, the clock frequency of the processor may then be increased to a level that is proportional to the inverse of the sum of the longer delay of the two sub blocks and the latch overhead. Finally, in case 3 of figure 2, we repeat the exercise to divide the sub block into 4 stages, with the limitation on the clock now decreased to 29 FO4 + latch overhead gate depths.
There are several factors of some importance to note: increased pipeline depth will not achieve inversely proportional increases in clock rate, pipestage balance is important for frequency scalability, and gains in throughput as measured in computations per second can theoretically be continously increased until sub block logic depth equals one. However, as Sprangle et al. and Hrishikesh et al. discussed in their respective papers in the proceedings of ISCA 29, ultimate processor performance is limited by various factors such as cache miss rates and branch prediction accuracy  . These papers argue that there is an “optimal” point where a processor can be aggressively scaled to achieve maximum performance, beyond which the over-scaled processor would lose performance. In fact, Hrishikesh et al. reports the number as 6 to 8 FO4 inverter delays, and incorporates the proclamation into the title of the paper.
What are the logic depths of current state of the art processors, as measured with the FO4 metric?
There are several sources that discuss current SOA processors in the context of the FO4 metric:
- Horowitz, Page 38: “Current” SOA is approximately 16 FO4.
- Hrishikesh et al. : Current Intel Processors are ~12 FO4.
- Chinnery et al. : Alpha 21264 has 15 FO4.
- Chinnery et al. : Custom IBM PPC test chip, 1 GHz @ 0.25um, FO4 of 13.
As an aside, Motorola’s 7450 RISC Family Technical Summary also includes a section on the increased pipeline length, reduced pipe stage logic depth, and the resulting increases in latency . Although Motorola does not explicitly state that the “Logic Inversions per Cycle” is indeed the FO4 metric, the effects of the reduction in logic depth and associated increase in clock rate is universal (pages 46 through 48).
Furthermore, Sprangle et al. provided some hints as to the FO4 depth in the Willamette processor in their recent ISCA 29 paper. In the paper, Sprangle et al. cited personal communications with Rajesh Kumar, and reported that “in a standard 0.18um process, a typical flop equates to about 3 FO4 delays, with the FO4 delay being about 25ps” . Sprangle et al. further reports that the Willamette processor design assumes an average, nominal overhead for latch delays of 90 ps per cycle. Given that the Willamette processor was publicly offered for sale at a frequency of 2 GHz as manufactured on Intel’s 0.18um process technology, we can perhaps assume a cycle time of 500 ps for the purposes of our computations at the 0.18um node. With a cycle time of 500 ps, and the nominal overhead of 90 ps, simple arithmetic suggests that there are 410 ps of timing budget with which logic can propagate through a single pipe stage. Furthermore, assuming an FO4 delay of 25ps would result in the computation that as many as 16.4 FO4 (equivalent) logic delays may fit into the time of useful computation of a single cycle. That being said, 16.4 FO4 gate depth is unlikely the gate depth actually utilized in the Willamette processor, as the arithmetic completely ignores the effect of wiring delays and other operating margins that should be maintained for the functional correctness guarantee of the processor. It is however, useful as an upper bound for the possible FO4 depth on the Willamette processor. (Author’s note, Agrawal et al. claims that Intel’s Pentium III processor has a per stage FO4 delay of 15 . This number appears to deviate substantially from the other numbers reported here, and may either be an error or a different manner of FO4 computations. One of the stated goals of Willamette (P4) was to obtain higher clock frequency through the use of low FO4 depth logic as compared with its predecessor. If the pipeline stages in the Pentium III processor has a FO4 depth of 15, the Willamette (P4) must have a FO4 depth in the neighborhood of 8 using an equivalent metric. If this number is correct, this alternative methodology may separate out wire delay as a separate component from logic delay in the FO4 delay computation.)
Discuss (77 comments)