#### What about FO4 delays in actual circuits?

Numerous articles have been published describing specific implementations of common logic blocks. A search through any popular search engine using the terminology of “FO4” and “multiplier” or “adder” should quickly yield a large number of potential resources on the topic. We will simply cite one paper here by Dhanesha et al. as they reported on their design and implementation of an IEEE 754 compliant floating point multiplier [2]. The multiplier is reported here as possessing a multiplication latency of 23.3 FO4 (equivalent) gate delays. In his presentation slides, Horowitz also cited claims of ~7 FO4 for a fast 64 bit adder, and Agarwal et al.[8] cites a 5.5 FO4 64 bit adder as presented by Naffziger in ISSCC96. Relatively, the fast ALU’s in the Willamette processor is said to be able to compute a 16 bit result in half of a cycle [4]. Using the previously given FO4 estimates of between 12 to 16, and accounting for possible latch overhead at the half cycle mark, we believe that the fast ALU in the Willamette processor would have to perform the 16 bit arithmetic with a timing budget that is no more than the 5 FO4 delays described. Coincidently, 5~6 FO4 delay is also the predicted lower bound of pipeline stage scalability [8].

#### What about a Barrel Shifter? Do Shifts really have to have a 4 cycle latency?

**Quick Answer:** Short of someone from Intel actually posting an
official response to this question, we can only engage in an academic exercise
by an examination of the structure of the shifter, and decide if a “better
circuit” that has lower (cycle count) latency is indeed possible. The
discussion and the use of the barrel shifter is simply motivated by a desire to
understand why the latency of a shift instruction had been increased from 1
cycle to 4 cycles from the Pentium III processor generation to the Pentium 4
processor generation.

**Disclaimer:** This author has never seen the shifter circuit as
implemented on the Willamette (P4) processor. The shifters implemented here are
simply classical barrel shifters with some variations as drawn from the Weste
and Eshraghian text[7].

In figure A_{1}, we present a fairly classical,
albeit redundant, left shifting barrel shifter. Some of the muxes are in
reality not needed, since there are no shifted input bits to be selected from
at these nodes. These muxes are retained for simplicity of the logic diagram.
The design presented here is a simple log_{2} left shifter. In figure A_{1}, we have carefully marked separate elements of the barrel shifter for further
discussion. Specifically, the 2 input mux block will be examined in figure A_{2}, and the control signals will be examined further in figure A_{5}. Fundamentally, this logic diagram shows a set of input to this circuit, and this input is a 32 bit wide bit vector marked a_{m}, where (0 <= m <= 31). The bits
in the input vector are shifted by each logic level, with each level being
controlled by a single signal, ctrl_{n }. The function of the ctrl_{n}
signal is to determine whether or not at the n^{th} level, the input
bits for that level should be shifted by 2^{n} bits to the left.

From the logic diagram alone, it appears clear that this circuit will have a
significant area as well as delay component attributed to the module
interconnects. Furthermore, the control signal at each level is abstractly
shown in the logic diagram as being bypassed from mux to mux. While this
serialized bypass chain would serve to limit the extent of the wiring for
control signals, this organization would in fact result in a delay chain that
is proportional to n, the width of the shifter. One possible alternative is to
buffer and distribute the control signal at each level with a FO4 inverter
tree. This is the assumption that we take in the more detailed discussion on the
distribution of the control signal in figure A_{5} below.

#### Conservative Implementations of a Mux

In figure A_{2}, we show the generic structure of a
simple 2 input multiplexer (MUX). The functionality of the 2 input mux is to
make use of a shift control signal to determine whether the value of the
shifted input or the value of the not-shifted input gets asserted upon the
output of the logic circuit. The mux itself may be implemented with numerous
different types of circuits, but we have selected two conservative
implementations of the 2 input MUX and illustrated them as figures A_{3}
and A_{4} respectively. Each of the alternative designs is problematic
in their own respective manner. However, since the goal here was to simply
obtain a first order delay approximation for the circuit, and short of an
extensive implementation and simulations to obtain a second order
approximation, we believe that these conservative implementations can be used
as place holders until more aggressive and faster implementations are
investigated. With the caveats as given, we see that in either implementation,
a 2 input mux would incur the cost of 2~3 FO4 gate depth per node. (Author’s
note: No, you can’t use 5 levels of pass gates and count each level as FO4 of 1.)

#### Control Signal Buffering

In figure A_{5}, we examine the control signal in
greater detail. As stated previously, the functionality of each control signal
ctrl_{n} is to control the shift/no shift decision at the n^{th}
level. In order to perform this functionality, the same signal has to be distributed
to all 2 input muxes at the n^{th} level. The number of muxes at a
given level depends on the number of bits that must be shifted for that level.
Since we are interested in a timing analysis of the circuit, we will take the
worse case of ctrl_{0}, where the control signal must be distributed to
all 32 muxes at the 0^{th} level. Furthermore, we assume loading factor
of 2 for each module that the control signal must be sent to. Altogether, the
control signal would have to drive an output of 64 gates. In figure A_{5},
we show that if we use simple inverters to buffer the control signal, with each
inverter supporting a fan out of 4, then the simple computation of log_{4}
(64) = 3 reveals to us that 3 levels of FO4 inverters are needed to properly
buffer the control signal. Although 3 levels of inverter buffering would leave
the control signal inverted, we would only have to reverse the connections for
the mux for the circuit to operate in a functionally correct manner.

Pages: « Prev 1 2 3 4 5 Next »

Discuss (77 comments)