Integer Execution Units
The Istanbul microarchitecture was conceptually oriented around a set of three lanes. Each lane had a dedicated scheduler, tied to largely identical group of integer functional units. Bulldozer abandons the idea of lanes with dedicated schedulers, in favor of a more flexible, unified 40 entry scheduler that can issue to any of the execution units. As shown in Figure 5, Bulldozer features a substantially different assortment of functional units. In Istanbul, the three lanes had a full ALU and an AGU to simplify scheduling by making each lane identical. With Bulldozer, this is no longer necessary. Four ALUs and AGUs would be very power and area inefficient, while providing little additional benefit. Thus to improve throughput (by decreasing core size), AMD reduced the number of integer execution units.
Bulldozer’s integer execution units can be thought of as two mostly identical groups (0 and 1), where each group is composed of an AGU and an ALU. The two AGUs (AGU 0 and 1) are identical and perform the address calculations that feed into the load-store unit and cache hierarchy. The integer execution units (EX 0 and EX 1) are each a fully featured ALU capable of executing the vast majority of integer operations. There are some slight differences between the two pipelines though. EX 0 is responsible for POPCNT and LZCOUNT operations and also contains a variable latency, unpipelined integer divider. EX 1 has a pipelined integer multiplier and also handles any fused branches (which must be placed into the last slot in a dispatch group).
Figure 5 – Bulldozer Execution Units and Comparison
Shared Floating Point Execution Units
While Bulldozer can execute the new AVX instructions, all of the execution units are 128 bits wide. Thus it is extremely likely that any 256-bit instructions must be executed as two uops and are decoded as two macro-ops (since a macro-op cannot contain two execution uops). This would be consistent with AMD’s conservative approach to embracing instruction set extensions from Intel.
For example, while Intel introduced SSE2 with the P4 in 2000, it took until the K8 in 2003 for AMD to support the new instructions. Similarly, Intel went to 128-bit execution units in 2006 and it wasn’t until Barcelona in 2007 that AMD caught up. The rationale for AMD’s slower uptake is straight forward. New instructions are not immediately put to use by most software vendors – so even though Sandy Bridge will arrive in late 2010 or early 2011, most software will not use AVX. Consequently, it does not make sense for AMD to dedicate the die area, design effort and power until the software really does catch up.
Like the integer cores, Bulldozer’s floating point cluster does away with the notion of dedicated schedulers and lanes and uses the more flexible unified approach. The four pipelines (P0-P3), are fed from a shared 60 entry scheduler. This is roughly 50% larger than the reservation stations for Istanbul (42 entries) and almost double Barcelona’s (36 entries). The heart of the cluster is a pair of 128-bit wide floating point multiply-accumulate (FMAC) units on P0 and P1. Each FMAC unit also handles division and square root operations with variable latency.
The two FMAC units execute FADD and FMUL instructions, although this obviously leaves performance on the table compared to using the combined FMAC. The first pipeline includes a 128-bit integer multiply-accumulate unit, primarily used for instructions in AMD’s XOP. Additionally, the hardware for converting between integer and floating point data types is tied to pipeline 0. Pipeline 1 also serves double duty and contains the crossbar hardware (XBAR) used for permutations, shifts, shuffles, packing and unpacking.
Another question regarding Bulldozer is how 256-bit AVX instructions are handled by the execution units. One option is to treat each half as a totally independent macro-op, as the K8 did for 128-bit SSE, and let the schedulers sort everything out. However, it is possible that Bulldozer’s two symmetric FMAC units could be ganged together to execute both halves of an AVX instruction simultaneously to reduce latency.
The other half of the floating point cluster’s execution units actually have little to do with floating point data at all. Bulldozer has a pair of largely symmetric 128-bit integer SIMD ALUs (P2 and P3) that execute arithmetic and logical operations. P3 also includes the store unit (STO) for the floating point cluster (this was called the FMISC unit in Istanbul). Despite the name, it does not actually perform stores – rather it passes the data for the store to the load-store unit, thus acting as a conduit to the actual store pipeline.
In a similar fashion, there is a small floating point load buffer (not shown above) which acts as an analogous conduit for loads between the load-store units and the FP cluster. The FP cluster can execute two 128-bit loads per cycle, and one of the purposes of the FP load buffer is to smooth the bandwidth between the two cores. For example, if the two cores simultaneously send data for four 128-bit loads to the FP cluster, the buffer would release 256-bits of data in the first cycle, and then 256-bits of data in the next cycle.