As promised, Merom is far wider than the P6 or P4. Figure 4 below shows a detailed comparison of the fetch and decode sections for Intel’s microarchitectures.
Figure 4 – Front-end Microarchitecture Compared
Intel did not initially disclose the fetch bandwidth, but the average x86 instruction is ~32 bits and it was later revealed that the instruction fetch remained at 128 bits. These instructions go into the pre-decode and fetch buffer, which also stores information about instruction length and decode boundaries. The pre-decode and fetch buffer is 32 bytes, as with the original P6, but feeds into an 18 entry instruction queue (IQ – not depicted above). When a loop is entirely contained within the IQ, the instruction fetcher will shut down to save power, using the IQ as a fetch cache.
The somewhat disappointing trace cache is gone, replaced by four x86 decoders, which pull instructions from the IQ. Each of the three simple decoders deal with x86 instructions that map to a single uop, while the complex decoder handles instructions that produce 1-4 uops (so the decode pattern is 4-1-1-1). The microcode sequencer is responsible for decoding or assisting instructions that produce more than 4 uops, as with prior designs. As with Yonah, all SSE instructions can be handled by the simple decoders, producing a single uop. effectively.
Additionally, as Figure 4 indicates, the Merom front-end introduces a new feature that is referred to as macro-op fusion. Within Intel, x86 instructions are called macro-ops, while the internal instructions are called uops. Macro-op fusion lets the decoders combine two macro instructions into a single uop. Specifically, x86 compare or test instructions are fused with x86 jumps to produce a single uop and any decoder can perform this optimization. Only one macro-op fusion can be performed each cycle, so the maximum decode bandwidth is really 4+1 x86 instructions per cycle. Macro-op fusion maps particularly well to the familiar if-then-else statement, which is a very common programming construct. Although Intel declined to comment, some estimates indicate that macro-op fusion can reduce the number of uops by 10%.
The benefits of macro-op fusion are readily apparent. Reducing the number of uops improves performance in two ways. The first is that fewer instructions are executed, which directly increases performance. Secondly, out-of-order execution becomes more effective since the out-of-order scheduling window can effectively examine more of the program at once and find more instruction level parallelism (ILP). Of course, these benefits are very similar to those from uop fusion, but improving a different class of instructions. Perhaps the most ironic part is that in some ways, macro-op and uop fusion are really making x86 MPUs interally more CISC-like, and less RISC-like.
Branch prediction occurs inside the Instruction Fetch Unit using many familiar predictors. Pentium M based designs featured the P4’s traditional Branch Target Buffer (BTB), a Branch Address Calculator (BAC) and the Return Address Stack (RAS) but also two new predictors. The Loop Detector (LD) correctly predicts loop exits, and the Indirect Branch Predictor (IBP) picks targets based on global history, which helps for branches to a calculated address. Merom uses all of these predictors, and has added a new feature. In prior designs, taken branches always introduced a single cycle bubble into the pipeline. By adding a queue between the branch target predictors and the instruction fetch, most of these bubbles can be hidden.