Prelude
Recently, there have been some spirited discussions on the Real World Technologies forum in regards to various architectural choices. Specifically, discussions on the various tradeoffs as taken by the designers of the Pentium 4 (Willamette) processor were vigorously engaged. The author of this article, being guilty by association, freely participated in the bru-ha-ha. However, from this author’s perspective, the discussion ended on a rather unsatisfactory note, as the discussions were highly theoretical, and abstractions upon abstractions were engaged in attempts to convey difficult concepts. It is the personal opinion of this author that such discussions in the theoretical realm can result in a wide array of interpretations by different readers, including bewilderment. It is the author’s personal preference that in cases where abstract concepts are discussed, concrete examples should be given as illustration to the extent possible. This article has been written with this concept as the central theme. If at all possible, concrete values will be cited and used in this article.
Introduction
Discussions in regards to the pipeline depth of a processor and how that depth affects the design and implementation of various functional units of the given processor have from time to time erupted onto usenet. Software programmers are enjoying the improved performance that increasing processor frequencies have brought them, but some programmers have become alarmed as to the increasing latency of various functions associated with the increase of processor clock frequencies. It has been claimed in abstraction by some hardware designers that some of the increasing functional unit latencies are inevitable consequences of more advanced processor microarchitectures. This article will carry on that tradition and attempt to justify some of the design decisions by citing research from Industry and Academia that supports the rationale of decreasing pipeline stage logic depth in pursuit of higher clock frequencies. The justification given is that a processor may attain higher performance through the sacrifice of IPC (instructions per cycle), if the increase in frequency as measured with the CPS metric (cycles per second), can be justified by a net increase in performance as measured with the IPS metric (instructions per second).
As part of the discussion to increase processor frequencies through the use of advanced microarchicture, it is taken for granted that the cycle time of an advanced, state-of-the-art processor is fundamentally limited by the worst case pipe stage logic delay [1][3][4][5][8]. In the context of discussions of logic delays, the FO4 metric is used by processor architects as a process neutral metric that can be applied to abstract design and architectural discussions. In this article, we will seek to explore some of the common uses of the FO4 metric, and apply it in an abstract manner to a circuit as an example.

In figure 1, we show that The FO4 metric is simply the delay through an inverter that has to provide the output drive current sufficient to drive 4 other inverters of comparable sizes. In his presentation slides, Horowitz claims that the FO4 metric is fairly stable through various process, temperature and voltage variations [1]. Also shown in figure 1 are several other general assumptions such as: longer wires are slower, and smaller (static logic: W/L ratio) devices are slower. Finally, not shown in figure 1, but contained in the presentation slides by Horowitz are the following concepts that we will take for granted as being true: FO4 equivalent gate depth can be decreased by faster circuit types (dynamic logic), more advanced logic design such as ones that can perform computation in O(N Log(N)) instead of O (N2), and more advanced microarchitectures. We will illustrate some of these concepts in our implementation of the barrel shifter example later in the article.
Discuss (77 comments)