The Front-end: Decode Phase
Once x86 macro-instructions have been fetched into the instruction queue, they are ready for decoding into micro-ops (uops), the internal RISC-like instructions supported by Intel’s architecture. Core 2 and Nehalem both have four decoders, one complex and three simple. A simple decoder can handle any x86 instructions which decode into a single uop, which includes most SSE instructions now. The complex decoder can decode instructions which maps to 1-4 uops and the microcode sequencer handles anything more complicated than that.
Figure 2 – Front-end Microarchitecture Comparison
Nehalem refines and improves the macro-op fusion already found in the previous generation. In 32 bit mode, the Core 2 could decode comparisons (CMP) or tests (TEST) and conditional branches (Jcc) into a single uop, CMP+JCC. This increased the decode bandwidth of the Core 2 and reduced the uop count, making the machine effectively wider. Macro-op fusion in Nehalem works with a wider variety of branch conditions, including JL/JNGE, JGE/JNL, JLE/JNG, JG/JNLE, so any of those, in addition to the previously handled cases will decode into a single CMP+JMP uop. Best of all, Nehalem’s macro-op fusion operates in both 32 bit and 64 bit mode. This is essential, since the majority of servers and workstations are running 64 bit operating systems. Even modern desktops are getting close to the point where 64 bits makes a lot of sense, given the memory requirements of modern operating systems and current DIMM capacities and DRAM density. In addition to fusing x86 macro-instructions, the decoding logic can also also fuse uops, a technique first demonstrated with the Pentium M.
Once x86 instructions are decoded into uops, they go into a 28 entry uop buffer. As alluded to earlier, Core 2 had a Loop Stream Detector in the 18 entry instruction queue that acted as a cache for the instruction fetch unit. Nehalem improves on this concept by moving the LSD further down the pipeline across the decode stage into the new 28 entry uop buffer.
If a loop is less than 28 uops, then Nehalem can cache it in the LSD and issue into the out-of-order engine without using the instruction fetch unit or the decoders. This saves even more power than the Core 2 when using the LSD, by avoiding the decoders and more loops can be cached. Nehalem’s 28 entry uop buffer can hold the equivalent of about 21-23 x86 macro-instructions based on our measurements from several games. The ratio of macro-ops/uops depends heavily on the workload, but in general Nehalem’s buffer is ‘larger’ than that found in the Core 2.
One of the most interesting things to note about Nehalem is that the LSD is conceptually very similar to a trace cache. The goal of the trace cache was to store decoded uops in dynamic program order, instead of the static compiler ordered x86 instructions stored in the instruction cache, thereby removing the decoder and branch predictor from the critical path and enabling multiple basic blocks to be fetched at once. The problem with the trace cache in the P4 was that it was extremely fragile; when the trace cache missed, it would decode instructions one by one. The hit rate for a normal instruction cache is well above 90%. The trace cache hit rate was extraordinarily low by those standards, rarely exceeding 80% and easily getting as low as 50-60%. In other words, 40-50% of the time, the P4 was behaving exactly like a single issue microprocessor, rather than taking full advantage of it’s execution resources. The LSD buffer achieves almost all the same goals as a trace cache, and when it doesn’t work (i.e. the loop is too big) there are no extremely painful downsides as there were with the P4’s trace cache.
After the LSD buffer, the last step in decode is the dedicated stack engine, which removes all the stack modifying uops. The stack modifying uops are all executed by a dedicated adder that writes to a speculative delta register in the front end, which is then occasionally synchronized with a renamed architectural register that contains the non-speculative value of the stack. After the stack manipulating uops have been excised, the remaining uops head down into the out-of-order machine to be renamed, issued, dispatched and executed.