How the Heck do They Do That?
The obvious conclusion is that Intel is has rolled out some serious innovation in integer datapath microarchitecture for Willamette and possibly some new circuit techniques too. I will put forward one possible explanation of how the Willamette ALUs are implemented. It may not be exactly what Intel has done but the odds are it is not too far off. First of all, the double clocking of the ALU has a lot of people amazed. There are three basic techniques for making a small section of circuitry operate at twice the global clock rate. The first is to divide the unit into two sections and clock one normally on the rising edge of the clock and the second section on the falling edge of the clock and then interleave and combine the results. The second is to employ flip-flops (memory circuits) that operate on both edges of the clock The third method is to simply provide a special clock signal to the unit that toggles at twice the frequency as the processor clock.
Intel has apparently chosen the third option. In U.S. patent 6,023,182 Intel discloses a clock buffering circuit that, among other things, generates an out put clock at twice the frequency of the input clock using one shot pulse generators triggered on both the rising and falling edges of the input clock. This circuit can be dropped down anywhere on a microprocessor to drive a functional block such as an ALU at twice the rate of the globally distributed processor clock. The advantage of this scheme is that it avoids the formidable challenge of generating and distributing both a regular clock and a double frequency clock over the entire device. The operation and likely use of this clock frequency doubling buffer in Willamette is shown in Figure 6.
Figure 6. Hypothetical Willamette ALU Clocking scheme
So, we know how Intel can clock its ALUs at 3.0 GHz while surrounded by a sea of logic operating at 1.5 GHz., but I still haven’t explained how an addition can apparently be performed in the 0.20 ns or so needed to fit into a single stage of a pipeline operating at 3.0 GHz. Could Intel have discovered a new way of performing addition that only takes 40% of the time needed by standard adder circuits used today? This possibility is extremely remote. Engineers and mathematicians having been studying binary math and arithmetic circuit design since the days computers were built using electromechanical relays 55 years ago and all the possibilities have been pretty well gone over. There are known techniques, such as logarithmic adders, that can add faster than the carry lookahead design typically used in microprocessors, but the speed up for a 32 bit addition is not that great and comes at a fairly large cost in area.
As I said before, the really time consuming part of addition occurs when a carry is generated at the least significant digit and needs to be propagated up 32 bits to the most significant bit. This is analogous to manually adding 1 to 999999 using pencil and paper arithmetic. However there is a trick often used in multiplier arrays called carry-save addition which gets around this. A carry-save adder basically puts off the task of carry propagation by generating two results, a partial sum value and a bit vector of carries. A decimal version of a carry-save adder, given our previous example of 1 plus 999999, would say the answer is 999990 (partial sum) plus 000010 (carry vector). As we see here, the drawback of the carry-save adder is that we eventually need a true adder circuit to combine the partial sum and carry. Although it appears practically useless, carry-save addition has the advantage of being able to perform a long series of additions very quickly, and requiring a single regular addition at the end to generate the true answer.
So, I think the secret of the Willamette adder is that it cannot perform a complete addition in half a processor clock cycle. Instead, it performs a carry-save addition (which is very fast – basically the delay of a single bit full adder circuit which is likely 0.1 ns or less in 0.18 um CMOS) in the first adder pipeline stage, and feeds back the partial sum and carry vector to two of the three adder inputs through bypass MUXes. This allows a second instruction which depends on an addition result for one of its inputs to be started an ALU clock period (half a processor clock period) after the addition enters the adder. This still leaves us with the problem of performing a true full 32-bit addition in one or more subsequent pipeline stages within the adder. One solution is to place the first part of binary look-ahead carry (BLC) adder in the back end of the first adder pipeline stage, and the remainder of the BLC adder in the second and third stage. The BLC adder is a logarithmic class adder that has the advantage of being easily pipelined because it has the same number of logic levels from its inputs to its outputs. Spreading a BLC addition across two and half ALU clock periods gives about 0.5 ns to perform the operation after accounting for pipelining overhead and should be quite readily accomplished in Intel’s P858 0.18 um process. This hypothetical Willamette adder scheme is shown in Figure 7. The drawback of this adder design as shown is that it cannot issue a dependent addition two ALU clock cycles after the first addition (although this would be possible if the partial sum and carry vector were retained through one more pipe stage and brought back to the bypass MUXes.)
Figure 7 Hypothetical Willamette ALU Adder
Be the first to discuss this article!