Instruction Set Extensions and uop Format
Roughly speaking, Bobcat was designed to be compatible with AMD’s Barcelona, which is a slightly awkward target. While virtualization and SSSE3 is fully supported, Barcelona and Bobcat have the SSE4A instructions, which is a subset of SSE4.1 as found in Intel’s Penryn as well as several additional instructions for handling misaligned data accesses.
In comparison, Jaguar is much more modern and presents a cleaner software interface. Jaguar has full compatibility with SSE4.1, SSE4.2, the 256-bit AVX extensions, as well as 16-bit FP conversion. AVX also adds 3-operand register addressing for both 128-bit and 256-bit AVX instructions. In addition, Jaguar offers AES, carry-less multiplication (CLMUL) and bit-manipulation (BMI) instructions along with big-endian move (MOVBE) and leading-zero and pop-count instructions. That being said, Jaguar is not fully software compatible with Bulldozer as it is not necessary for the target markets. For example, the new FMA instructions are useful for HPC workloads, whereas Jaguar is primarily targeted at low-power consumer notebooks and consoles.
Conceptually, the instruction decoding for Jaguar and Bobcat resembles the K8, with some modifications. Jaguar and Bobcat can decode two x86 instructions per cycle. Like AMD’s earlier Barcelona and Bulldozer designs, the variable length x86 instructions are decoded into fixed length but relatively complex micro-operations (COPs) that are tracked by the control logic. The actual COPs can map to multiple µops executed by the hardware. A single COP is capable of performing a read from memory, an ALU operation, and a write to memory, but most COPs map to one or two µops.
Not only is the x86 ISA variable length, but the extensive use of prefixes makes it impossible to determine the location of the op code ahead of time. Barcelona and K8 derivatives pre-decoded instructions and marked instruction length, and cached this information in the L1I and L2. This approaches improves frequency, but also adds complexity by requiring additional storage, a separate slower pre-decode pipeline, and checking logic. Jaguar and Bobcat are narrower cores, with lower frequency targets than the K8 and derivatives, which simplifies decoding considerably. So AMD’s architects chose to eliminate marker bits and handle the variable length instructions directly, eliminating complexity in the front-end.
Instruction decoding in Bobcat and Jaguar starts in the fourth cycle and overlaps with the latter part of the fetch pipeline. Decoding was four stages in Bobcat, but the Jaguar team added an extra stage to achieve higher frequencies.
The decoders access the oldest 16B IBB entry, and the first 6B of the subsequent IBB entry. The first two stages handle pre-decoding and instruction length decoding. Bobcat uses a single stage for the actual instruction to COP decoding, while Jaguar uses two to boost frequency. Pairs of instructions are typically decoded from the oldest IBB entry. However, in some cases the first instruction will end the oldest 16B IBB entry; in that scenario, reading the extra 6B from the next entry eliminates most pipeline bubbles.
Like the K8 and derivatives, the vast majority of x86 instructions decode into a single COP (sometimes described as a fastpath single) given the powerful format. However, some instructions are cracked into two COPs (a fastpath double). One of the biggest differences between Jaguar and Bobcat is the classes of instructions that are fastpath singles versus doubles.
Both Jaguar and Bobcat can execute vector instructions that are wider than the underlying vector execution units. Bobcat was designed with 64-bit vector units, but has support for a variety of SSE instructions. Jaguar is a significant step forward and has 128-bit vector units, but supports AVX.
Mirroring AMD’s approach to SSE2 with the early versions of the K8, both designs handle this mismatch by cracking the longer vector instructions into two COPs. The key difference is that Jaguar offers full performance for SSE instructions, which are the de facto standard in x86-64.
Microcoded instructions are fairly rare in most programs – a typical case is complex system instructions for power management and state management. The other use of microcode is handling classes of instructions where dedicated hardware is simply not available. For example, some of the SSE4.2 string instructions and cross-lane AVX instructions are microcoded in Jaguar.
The decoding stages for Jaguar and Bobcat also incorporate a side-band stack optimizer, which AMD first introduced with Barcelona. Instructions such as CALL, RET, PUSH, and POP implicitly modify the stack pointer, creating long dependency chains. The stack optimizer splits the stack pointer into two dedicated registers, a base register and a delta register. The stack pointer manipulations are all performed on the delta register by a dedicated ALU in the front-end and the two registers are synchronized as needed. This eliminates µops from the out-of-order machine and improves IPC.
The last decoding pipeline stage for both Jaguar and Bobcat packs two COPs, assigning them to lanes based on execution unit restrictions and writes them into an instruction queue. The instruction queue can hold ~6-8 entries and absorbs any delays due to stalls later in the pipeline.
Discuss (86 comments)