What about FO4 delays in actual circuits?
Numerous articles have been published describing specific implementations of common logic blocks. A search through any popular search engine using the terminology of “FO4” and “multiplier” or “adder” should quickly yield a large number of potential resources on the topic. We will simply cite one paper here by Dhanesha et al. as they reported on their design and implementation of an IEEE 754 compliant floating point multiplier . The multiplier is reported here as possessing a multiplication latency of 23.3 FO4 (equivalent) gate delays. In his presentation slides, Horowitz also cited claims of ~7 FO4 for a fast 64 bit adder, and Agarwal et al. cites a 5.5 FO4 64 bit adder as presented by Naffziger in ISSCC96. Relatively, the fast ALU’s in the Willamette processor is said to be able to compute a 16 bit result in half of a cycle . Using the previously given FO4 estimates of between 12 to 16, and accounting for possible latch overhead at the half cycle mark, we believe that the fast ALU in the Willamette processor would have to perform the 16 bit arithmetic with a timing budget that is no more than the 5 FO4 delays described. Coincidently, 5~6 FO4 delay is also the predicted lower bound of pipeline stage scalability .
What about a Barrel Shifter? Do Shifts really have to have a 4 cycle latency?
Quick Answer: Short of someone from Intel actually posting an official response to this question, we can only engage in an academic exercise by an examination of the structure of the shifter, and decide if a “better circuit” that has lower (cycle count) latency is indeed possible. The discussion and the use of the barrel shifter is simply motivated by a desire to understand why the latency of a shift instruction had been increased from 1 cycle to 4 cycles from the Pentium III processor generation to the Pentium 4 processor generation.
Disclaimer: This author has never seen the shifter circuit as implemented on the Willamette (P4) processor. The shifters implemented here are simply classical barrel shifters with some variations as drawn from the Weste and Eshraghian text.
In figure A1, we present a fairly classical, albeit redundant, left shifting barrel shifter. Some of the muxes are in reality not needed, since there are no shifted input bits to be selected from at these nodes. These muxes are retained for simplicity of the logic diagram. The design presented here is a simple log2 left shifter. In figure A1, we have carefully marked separate elements of the barrel shifter for further discussion. Specifically, the 2 input mux block will be examined in figure A2, and the control signals will be examined further in figure A5. Fundamentally, this logic diagram shows a set of input to this circuit, and this input is a 32 bit wide bit vector marked am, where (0 <= m <= 31). The bits in the input vector are shifted by each logic level, with each level being controlled by a single signal, ctrln . The function of the ctrln signal is to determine whether or not at the nth level, the input bits for that level should be shifted by 2n bits to the left.
From the logic diagram alone, it appears clear that this circuit will have a significant area as well as delay component attributed to the module interconnects. Furthermore, the control signal at each level is abstractly shown in the logic diagram as being bypassed from mux to mux. While this serialized bypass chain would serve to limit the extent of the wiring for control signals, this organization would in fact result in a delay chain that is proportional to n, the width of the shifter. One possible alternative is to buffer and distribute the control signal at each level with a FO4 inverter tree. This is the assumption that we take in the more detailed discussion on the distribution of the control signal in figure A5 below.
Conservative Implementations of a Mux
In figure A2, we show the generic structure of a simple 2 input multiplexer (MUX). The functionality of the 2 input mux is to make use of a shift control signal to determine whether the value of the shifted input or the value of the not-shifted input gets asserted upon the output of the logic circuit. The mux itself may be implemented with numerous different types of circuits, but we have selected two conservative implementations of the 2 input MUX and illustrated them as figures A3 and A4 respectively. Each of the alternative designs is problematic in their own respective manner. However, since the goal here was to simply obtain a first order delay approximation for the circuit, and short of an extensive implementation and simulations to obtain a second order approximation, we believe that these conservative implementations can be used as place holders until more aggressive and faster implementations are investigated. With the caveats as given, we see that in either implementation, a 2 input mux would incur the cost of 2~3 FO4 gate depth per node. (Author’s note: No, you can’t use 5 levels of pass gates and count each level as FO4 of 1.)
Control Signal Buffering
In figure A5, we examine the control signal in greater detail. As stated previously, the functionality of each control signal ctrln is to control the shift/no shift decision at the nth level. In order to perform this functionality, the same signal has to be distributed to all 2 input muxes at the nth level. The number of muxes at a given level depends on the number of bits that must be shifted for that level. Since we are interested in a timing analysis of the circuit, we will take the worse case of ctrl0, where the control signal must be distributed to all 32 muxes at the 0th level. Furthermore, we assume loading factor of 2 for each module that the control signal must be sent to. Altogether, the control signal would have to drive an output of 64 gates. In figure A5, we show that if we use simple inverters to buffer the control signal, with each inverter supporting a fan out of 4, then the simple computation of log4 (64) = 3 reveals to us that 3 levels of FO4 inverters are needed to properly buffer the control signal. Although 3 levels of inverter buffering would leave the control signal inverted, we would only have to reverse the connections for the mux for the circuit to operate in a functionally correct manner.
Discuss (77 comments)