Shared Instruction Decode
Before delving further into decoding, it is useful to discuss some terminology. Intel refers to the variable length x86 instructions as macro-operations. These can be quite complex, with multiple memory and arithmetic operations. Intel refers to their own simpler internal, fixed length operations as micro-ops or uops. AMD’s terminology is subtly (and confusingly) different. In AMD parlance, x86 instructions are referred to as AMD64 instructions – again, variable length and potentially quite complex. An AMD macro-operation is an internal, fixed length operation that may include both an arithmetic operation and a memory operation (e.g. a single macro-op may be a read-modify-write). In some cases, AMD also refers to these macro-ops as complex ops or cops. An AMD uop is an even simpler, fixed length operation that performs a single operation: arithmetic, load or store – but only one, and not a combination. So in theory, an AMD macro-op could translate into 3 AMD uops. The best way to think about AMD’s arrangement is that macro-ops are how the out-of-order microarchitecture tracks work, while uops are executed in execution units. For the rest of this article, we will endeavor to use AMD’s terminology.
As with Istanbul, Bulldozer classifies instructions into three categories: FastPath Singles, which emit a single macro-op, FastPath Doubles which emit two macro-ops and Microcode or VectorPath (everything else). Since AMD’s macro-ops are fairly powerful and complex, most instructions tend to decode to a single macro-op. However, that is not likely the case for 256-bit AVX instructions since Bulldozer’s execution hardware is 128-bit wide. A 256-bit AVX instruction could actually generate two 128-bit loads and two 128-bit floating point operations. When faced with a similar situation for 128-bit SSE and the K8, AMD took the approach of mapping 128-bit SSE to two macro-ops. It is a good guess (although not certain) that Bulldozer also decodes 256-bit AVX instructions into two macro-ops, which has implications for the resources used in the pipeline from renaming to execution.
Figure 3 – Bulldozer Decode and Comparison
The decode phase for Bulldozer, shown in Figure 3, has been improved, but the changes are far less dramatic than for fetching. The decoding begins by inspecting the first two of the 16B windows in the IBB for a single core. In many circumstances, instructions can be taken from both windows, but there are restrictions based upon alignment, number of loads and stores, branches, and other factors which can restrict decoding to a single 16B window.
To accommodate both cores, Bulldozer’s decode stage has been widened. Bulldozer can decode up to 4 instructions per cycle. After examining the instruction window, the decoders translate each x86 instruction into 1 or 2 macro-operations and place them into a queue for dispatching. Microcoded instructions (i.e. those requiring more than 2 macro-operations) are handled by the microcode ROM and probably cannot proceed in parallel with FastPath instructions. While AMD did not disclose how the microcode works, they did imply that the microcode will at least maintain the same performance as in prior generations (i.e. emitting at least 3 macro-operations per cycle). Given that it is shared between two four-issue cores, they may have modestly improved the performance to emit 4 macro-ops per cycle.
In addition to having an extra decoder, Bulldozer is the first AMD CPU with branch fusion, whereby an adjacent arithmetic test or comparison and a jump instruction are decoded into a single macro-op. There are some restrictions on branch fusion, but they are likely to be relaxed over time. For example, Intel was the first to introduce this feature in the Core 2 Duo, but only for 32-bit mode. Subsequent iterations generalized the feature to a greater degree (64-bit mode and more combinations of arithmetic and control flow). Bulldozer also has the side-band stack optimizer, which renames the stack pointer and thus breaks dependencies between instructions that implicitly reference the stack pointer.
Once the instructions have been decoded into macro-ops, they are placed into a queue where they are put into dispatch groups of up to four macro-ops and then sent to one of the two cores.