Inside Barcelona: AMD’s Next Generation

Pages: 1 2 3 4 5 6 7 8 9 10

The Decode Phase

x86 instructions are fairly complicated to decode; they are variable length and because of prefixes, the position of the op code cannot be known ahead of time. To simplify decoding, the K8 and Barcelona both use pre-decode information that marks the end of an instruction (and hence the start of the next instruction). However, the first time an instruction is fetched into cache there is no pre-decode information. The instruction cache contains a pre-decoder which scans 4B of the instruction stream each cycle, and inserts pre-decode information, which is stored in the ECC bits of the L1I, L2 and L3 caches, along with each line of instructions. Since there are almost no writes to the instruction stream, parity and refetching from memory is sufficient protection from errors in the instruction cache and ECC is not really required for code. As noted previously, this pre-decode information also includes branch selection and other related information.

Like the Pentium Pro, the K7/8 has an internal instruction set which is fairly RISC-like, composed of micro-ops. Each micro-op is fairly complex, and can include one load, a computation and a store. Any instruction which decodes into 3 or more micro-ops (called a VectorPath instruction) is sent from the pick buffer to the microcode engine. For example, any string manipulation instruction is likely to be micro-coded. The microcode unit can emit 3 micro-ops a cycle until it has fully decoded the x86 instruction. While the microcode engine is decoding, the regular decoders will idle; the two cannot operate simultaneously. The vast majority of x86 instructions decode into 1-2 micro-ops and are referred to as DirectPath instructions (singles or doubles).

In Barcelona, 128 bit SSE computations now decode into a single micro-op, rather than two; this makes the rest of the out-of-order machinery, such as the re-order buffer and the reservation station more effective. The same goes for integer and FP conversions, and 128 bit load instructions, which are needed to complement the new SIMD capabilities. Note that 128 bit stores still create 2 micro-ops. Another tweak that AMD added is support for unaligned SSE memory accesses, which help more efficiently fetch instructions by packing code more densely.

At some point slightly during the decode stages, instructions are passed through a new piece of hardware in Barcelona, the sideband stack optimizer. The x86 instruction set supports stacks in hardware, and can directly manipulate the stack of each thread, using PUSH, POP, CALL and RET instructions. These instructions modify the stack pointer (ESP), which in the K8 would generate a micro-op; worse yet, usually these instructions came in long dependent chains, which is a pain for the out-of-order machine.

AMD introduced a side-band stack optimizer to remove these stack manipulations from the instruction stream, similar to the dedicated stack engine in the Pentium M. Both MPUs use two registers, ESPO and ESPD (this is Intel’s terminology). ESPO is the original value for the stack pointer and is held in a register in the out-of-order machine, while ESPD, the delta register, tracks changes made to ESP and is in the front-end. Since ESP is an architecture register, a special micro-op is provided to recover ESP from ESPO and ESPD, although the use of this ‘fix up’ operation is minimized in Barcelona. When a stack modifying instruction is detected, it is removed and resolved by a dedicated ALU which modifies ESPD. This means that many stack operations can be processed in parallel, and frees up the reservation stations, re-order buffers and regular ALUs for other work. The benefits of this technique are highly workload dependent, but AMD and Intel agree that usually 5% of the micro-ops can be eliminated.

The last part of the decoding is the pack buffer (which is probably still 6 entries, like the K8). However, to understand the purpose of the pack buffer, it is necessary to see in detail how the out-of-order execution actually works in Barcelona and the K8; so this discussion will wait till the next page.

Pages: « Prev   1 2 3 4 5 6 7 8 9 10   Next »

Discuss (61 comments)