AMD’s Bulldozer Microarchitecture

Pages: 1 2 3 4 5 6 7 8 9 10

Integer Cores

Like all previous designs from AMD (and in contrast to Intel), Bulldozer separates the integer and floating point schedulers, register files and execution units. In proof that Sutherland’s Wheel of Reincarnation applies to more than just graphics, Bulldozer employs a co-processor model for floating point and SIMD execution that is shared by both cores in a module – reminiscent of the days when x87 floating point co-processors would reside on a separate chip altogether. One advantage of this more formalized separation is that the floating point cluster might eventually be replaced or supplemented by a GPU shader array, an evolution of Bulldozer to fit the ‘Fusion’ mold. This co-processor model is an example of a substantial change that is also familiar from previous AMD CPUs, the resemblance is clear from Figure 4 below.

Each cycle, a group of up to four macro-ops is dispatched to one of the dedicated cores. The macro-ops are allocated into the 128 entry retirement queue, which is responsible for maintaining the bookkeeping logic for each macro-op in flight. Memory operations must also allocate an entry in the appropriate load or store queue, to maintain x86 consistency. Within each dispatch group, any integer and memory macro-ops are renamed into the 96-entry physical register file (which contains both architectural and speculative registers). However, any FP or SIMD macro-ops will be sent to the floating point cluster to continue execution, although the retirement status is tracked in the integer core. Note that a FP or SIMD memory access will dispatch both a memory macro-operation to the integer core and an execution macro-operation to the FP cluster.


Figure 4 – Bulldozer Out-of-Order Engine and Comparison

Bulldozer’s physical register file (PRF) based renaming is a fundamental change to the out-of-order microarchitecture. In some previous out-of-order designs (e.g. integer renaming in K7 derivatives up to Istanbul, P6 derivatives up to Nehalem) state was tracked in two separate structures – a re-order buffer (ROB) or future file and an architectural register file. The ROB is implemented as a single structure that contains both the data value for each renamed register and also status information for each in-flight macro-op. The architectural register file contains the data value for each architectural register (e.g. RCX or XMM0). Once a macro-op retires, its renamed register value is written into the appropriate architectural register. So in a ROB based design, the state is held into two structures, one for speculative state and one for architectural state.

In microarchitectures using a physical register file (such as Bulldozer, IBM’s Power4, Power5 and Power7, DEC’s 21264 and the Pentium 4), there also are two structures to hold the state, but they are divided based on function. One structure is the physical register file, which holds all the data values – both for speculative renamed registers and architectural registers. The other structure is what AMD calls the retirement queue, which holds a pointer to the appropriate entry in the PRF and also status information about each in-flight macro-op and the associated register (e.g. is the register speculative or architectural). This microarchitecture enables lower power retirement and rollback, by manipulating the retirement queue pointers that map into the PRF, rather than copying register values. This is the fundamental difference between the two approaches: ROBs keep data with the status information, while a PRF separates the data from the status information. In Bulldozer, up to 4 macro-ops can be retired each cycle, matching the throughput in the rest of the CPU. Branch mispredicts are handled by a flash clear, which likely reverts the retirement queue to a prior known good state.

Once renamed, macro-operations are placed into the 40 entry unified scheduler where they are held until all the necessary resources, such as source operands are available. When uops are ready, the scheduler will issue up to four of the oldest uops to the appropriate execution units.

Shared Floating Point Cluster

The floating point cluster for Bulldozer has its own four-wide out-of-order execution facilities including renaming, scheduling and register files. As previously noted though, it relies upon the integer cores for handling any loads and stores and also retiring macro-ops.

When a dispatch group is sent to a core, any FP or SIMD macro-ops are allocated an entry in the retirement queue, just like the integer macro-ops. However, the floating point or SIMD macro-ops are then sent to the FP cluster for renaming, scheduling and execution.

The incoming FP or SIMD macro-operations follow a very similar flow to the integer side. The FP and SIMD macro-ops are renamed into a physical register file (a similar design to the integer PRF). The PRF is dynamically shared between the two cores, and contains 160 entries. AMD has not disclosed the width of these PRF entries. If the PRF entries are 128-bits wide, then each 256-bit instruction will probably decode as two macro-ops and use two entries in the PRF, scheduler and the retirement queue, reducing the effective instruction window. This seems to be the most probable scenario, although there is a chance that the FP cluster has 256-bit registers.

Once macro-ops are renamed, they are moved into the scheduler. The unified FP scheduler contains 60 macro-operations. When the inputs for a uop are ready (either present in the scheduler or imminently available through the forwarding network), the uops are issued to the appropriate execution pipeline – which is described in the next section. Once a macro-operation is completed, it will signal to the appropriate core for retirement.

The first part of the floating point pipeline is driven by the dispatch logic, which is multithreaded, with switching at a single cycle granularity. Once macro-ops have been placed into the scheduler, there is no longer a distinction between macro-ops from the two cores. Thus the scheduler and execution units are effectively simultaneously multithreaded.

Pages: « Prev   1 2 3 4 5 6 7 8 9 10   Next »

Discuss (158 comments)