Willamette’s Arithmetic Logic Units (ALUs)
One of the biggest surprises disclosed about Willamette concerned its arithmetic logic units, or ALUs. An ALU is the functional block within a processor that actually computes. The ALU performs such operations such as add, subtract, compare and bitwise logical instructions such as as ANDing a register value with a bitmask constant. In most processors the ALU is set up to be able to perform a new computation every clock cycle. This is accomplished by surrounding the ALU with input and output registers and inserting bypass multiplexers (MUXes) at the ALU inputs as shown in Figure 5.
Figure 5. Conventional Single Cycle ALU Configuration
In this design, values from the register file, memory, or instruction immediate field are latched into the ALU operand registers on rising clock edge “N”. These values are then passed to the ALU though a set of bypass MUXes into the ALU. The result of the selected operation on the input values is latched temporarily into the result register on the following rising clock edge “N+1” before being sent on back the register file (and also back around to the ALU if the result is an input to an instruction executing a clock cycle later).
To see how the speed of the ALU affects the maximum clock rate of the processor we need to examine the path that signals must propagate through in one clock cycle. In this design, the input data takes some time from the rising clock edge “N” to appear at the output of the operand registers (Delay 1), time to pass through the bypass MUXes (Delay 2), and for the ALU to calculate the result (Delay 3). In order to reliably capture the operation result, this value must appear out of the ALU a little ahead of the second rising clock edge “N+1” (Setup time). We can express the maximum clock rate for this ALU pipeline organization as:
In most processors the ALU propagation delay dominates this speed path, but the other delays cannot be ignored. A simple rule of thumb for high clock rate MPUs employing this type of microarchitecture is the ALU delay can comprise at most about 65 to 70% of the minimum clock period value. For example, a 1 GHz processor has a minimum clock period of 1.00 ns so we would expect the worst case delay through the ALU to be no more than 0.65 to 0.70 ns.
ALU delay is generally dominated by the add/subtract circuit. Addition and subtraction are difficult operations to perform quickly because in the worst case a carry has to be propagated across all 32 bits. There are a variety of ways to build fast adders (I will talk only about adders from now on because addition and subtraction are nearly identical problems). In general, the faster an adder design is the more logic gates and chip area required to implement it. Table 1 shows some representative adder circuit delay values for processors from the past fourteen years.
|Year||Processor||Technology||Clock Rate (MHz)||Delay|
|1986||Intel 386||1.5 mm CMOS||16||9 ns|
|1989||Intel 486||1.0 mm CMOS||25||5 ns|
|1990||HP PA-RISC||1.0 mm CMOS||90||3.5 ns|
|1994||PowerPC 603||0.50 mm CMOS||80||3 ns|
|1997||PowerPC G3||0.30 mm CMOS||250||1 ns|
|1999||PowerPC (64 bit)||0.22 mm CMOS|
When Intel demonstrated the Willamette processor at their developer’s forum last month, not only were these 0.18 um aluminum interconnect devices running as fast as 1.5 GHz, Intel disclosed that the ALUs were in fact operating at twice the processor frequency or 3.0 GHz! This fact is incredible when one considers that if one extrapolates the data in Table 1 it is hard to see how Intel could build a 32 bit adder circuit with delay shorter than about 0.50 ns in their P858 0.18 um process. Using the rule of thumb previously described would seem to limit ALU operation to about 1.3 GHz which is well short of 3.0 GHz. So it is quite obvious that Intel has implemented an ALU pipeline completely different from that shown in Figure 5, and which is unlike any yet seen in a publicly disclosed microprocessor or digital signal processor.
The Willamette’s 3.0 GHz ALUs are surprising, but existing techniques such as superpipelining (letting the ALU propagation delay stretch over two or more clock periods) are available that could easily explain such high clock rates. But Intel actually packed a double whammy into their Willamette ALU disclosure. Normal application of superpipelining provides high clock rates but at the cost of extra latency. If one stretches an addition across two pipe stages then one wouldn’t expect that an instruction that uses the result of an addition to be able to execute until two or more clock cycles after the instruction performing the addition. Yet Intel has clearly indicated that the Willamette can cascade the result of an add instruction into a subtraction and then through a move instruction and finally feed an xor instruction in just two clock cycles. This is as shocking as if the Olympic track and field 10 km champion also won the 100 m dash.
Be the first to discuss this article!