The Out-of-Order Engines – Execution Units
Once operations enter the schedulers, they wait until the source operands are ready. Then the scheduler will dispatch the oldest instruction and operands to the appropriate functional unit. The integer functional units in Barcelona are mostly unchanged from the K8. The three integer ALUs in K8 and Barcelona can execute most instructions and are largely symmetric. The two exceptions are that only the first ALU has an integer multiplier, and the third is used for POPCOUNT and other similar instructions. Note that the forwarding network for Barcelona has been omitted because it is far too complex to display in an organized manner.
Figure 4 – Comparison of Execution Units
The first substantial change in Barcelona’s integer units is that integer division is now variable latency, depending on the operands. IDIV instructions are handled through an iterative algorithm. In the K8, each IDIV would go through a fixed number of iterations – regardless of how many were required to achieve the final result. 32 bit divides took 42 cycles, while a full 64 bit divide required 74 cycles to calculate. In contrast, Barcelona only iterates the minimum number of times to produce an accurate answer. The latency for Barcelona is generally 23 cycles, plus the number of significant bits in the absolute value of the dividend (unsigned divides are roughly 10 cycles faster). Additionally, the third ALU pipeline now handles the new LZCOUNT/POPCOUNT instructions.
The FPUs in Barcelona did change a bit. They were widened to 128 bits so that SSE instructions can execute in a single pass (previously they went through the 64 bit FPU twice, just as in Intel’s Pentium M). Similarly, the load-store units, and the FMISC unit now load 128 bit wide data, to improve SSE performance.
One important difference between AMD and Intel’s microarchitectures is that AMD has their address generation units (AGUs) separate from the load store units (LSUs). This is because, as we noted earlier, AMD’s micro-ops can contain a load, an operation and a store, so there must be at least as many AGUs as ALUs. In contrast, Intel uops totally decouple calculations from memory accesses, so the AGUs are integrated into the load and store pipelines. The difference in the underlying uops and micro-ops result in the different AGU arrangements.
Another distinction between the Barcelona and Core microarchitectures is that AMD’s ALUs are symmetric and can execute almost any integer instruction, while the ALUs for Core 2 are not symmetric and are slightly more restrictive. Each of the lanes must be nearly identical for AMD’s distributed schedulers and instruction grouping to work optimally. This is a clear architectural trade-off of performance and decreased control complexity versus power and increased execution complexity. Replicating three full featured ALUs uses more die area and power, but provides higher performance for certain corner cases, and enables a simpler design for the ROB and schedulers.