Instruction Decode and uop Cache
Decoding fetched instructions in Sandy Bridge has not changed much, as shown below in Figure 3. The instruction queue can send 4 pre-decoded instructions (or 5 with macro-fusion on a compare and branch) to the x86 decoders. The decoders read in the x86 instructions and emit regular, fixed length uops which are natively processed by the underlying hardware. The four decoders are still asymmetric; the first decoder can handle complex instructions that emit upto 4 uops and the other three decoders all handle simpler instructions. For microcoded instructions (i.e. >4 uops), the microcode will decode 4 uops/cycle until the instruction is finished – although this blocks regular decoding. In both Sandy Bridge and Nehalem, the decoders can emit at most 4 uops per cycle – no matter what mix of instructions are being decoded.
Most of the new 256-bit AVX instructions in Sandy Bridge are treated as simple instructions by the decoders, thanks to some rather clever implementation work in the execution units and memory pipeline. As with previous generations, the decoders all support uop micro-fusion so that memory-register instructions can decode as one uop. The dedicated stack pointer tracker is also present in Sandy Bridge and renames the stack pointer, eliminating serial dependencies and removing a number of uops. Once instructions are decoded, the uops are sent to the uop cache and to the back-end for allocation, renaming and out-of-order execution. Subsequent instruction fetches to the same address should hit in the uop cache, rather than relying on the normal instruction fetch and decode path.
Figure 3 – Sandy Bridge Instruction Decode and Comparison
The most interesting part of Sandy Bridge’s front-end is the new uop cache, which promises to improve performance and power consumption. Sandy Bridge’s uop cache is conceptually a subset of the instruction cache that is cached in decoded form, rather than a trace cache or basic block cache. The uop cache shares many of the goals of the P4’s trace cache. The key aims are to improve the bandwidth from the front-end of the CPU and remove decoding from the critical path. However, a uop cache is much simpler than a trace cache and does not require a dedicated trace BTB or complicated trace building logic. In typical Intel fashion, the idea first came back as an instruction loop buffer in Merom, then a uop loop buffer in Nehalem and finally a full blown uop cache in Sandy Bridge – a consistent trend of refinement.
In Sandy Bridge, the instruction stream is divided into a series of windows. Each window is 32B of instructions, possibly limited by any taken branches within the window. The mapping between the instruction cache and uop cache occurs at the granularity of a full 32B window. After an entire window has been decoded and sent to the back-end, it is also inserted into the uop cache. Critically important, this build process occurs in parallel with the normal operation of the front-end and does not impose any delays.
Lines in the uop cache are addressed by the IP of the first decoded x86 instruction stored in the line, so the cache is virtually addressed. The entries are tagged so that two threads can statically partition the uop cache. As an added benefit, context or VM switches do not cause a flush (as they did for the trace cache). Since the uop cache works at the granularity of decoded 32B windows, the pseudo-LRU eviction policy must evict all the uop lines corresponding to a given 32B window at once. Self modifying code may also cause evictions, but does not invalidate the whole uop cache.
Sandy Bridge’s uop cache is organized into 32 sets and 8 ways, with 6 uops per line, for a total of 1.5K uops capacity. The uop cache is strictly included in the L1 instruction cache. Each line also holds metadata including the number of valid uops in the line and the length of the x86 instructions corresponding to the uop cache line. Each 32B window that is mapped into the uop cache can span 3 of the 8 ways in a set, for a maximum of 18 uops – roughly 1.8B/uop. If a 32B window has more than 18 uops, it cannot fit in the uop cache and must use the traditional front-end. Microcoded instructions are not held in the uop cache, and are instead represented by a pointer to the microcode ROM and optionally the first few uops.
According to Intel, the uop cache performs like a 6KB instruction cache and has a roughly 80% hit rate. By comparison, the 12K uop trace cache of the P4 was supposed to have similar performance to a 8KB-16KB instruction cache. The 4-8X difference in effective storage between the two designs is driven by Sandy Bridge’s more powerful uops and eliminating duplicate entries, which plagued the P4’s trace cache.
Going back a bit to the instruction fetch, the predicted IP is used to probe the tags for the uop cache in parallel with the regular instruction cache. A hit in the uop cache tags yields set and way identifiers that are put into a match queue and then used to access the uop cache data arrays. When a hit occurs, the uop cache will retrieve all of the uops in a 32B window (i.e. up to 3 lines) into a nearby queue or buffer. It is quite likely that the uop cache data arrays only read out a single line each cycle, so a hit may take multiple cycles to finish. Like the decoders, the uop cache can send up to 4 uops per cycle to the Decoder Queue. However, a uop cache hit can span a full 32B window of instructions. This doubles the bandwidth of the traditional front-end, which is limited to 16B instruction fetches. The extra bandwidth is particularly helpful where the average instruction length is over 4 bytes; for example, AVX instructions use a 2 or 3 byte prefix. Additionally, since each uop cache line can hold more uops than the back-end can rename and allocate (6 vs. 4), the queuing within the uop cache can hide the pipeline bubbles introduced by taken branches and effectively ‘stitch across taken branches’.
A hit in the uop cache will completely bypass and clock-gate the instruction fetch and decode hardware – saving power and improving the front-end uop supply. After a hit, the uop cache calculates the next sequential IP using the instruction length fields that are stored in each uop cache line. As mentioned above, taken branches are handled by the traditional branch predictor. In the case of a uop cache miss, then the instruction fetch and decode proceed as previously described and the uop cache can be clock gated to save power.
The Sandy Bridge architects decided that handling partial hits in the uop cache is too complex and power inefficient. Resolving partial hits would require activating both the traditional front-end and the uop cache, plus additional logic for synchronizing the two partial outputs. The uop cache line placement and eviction policies were designed so that each 32B window is fully cached (or not at all). Thus when a hit occurs, the traditional front-end is not involved.
All the uops from the traditional front-end and the uop cache are ultimately delivered into the 28 uop Decoder Queue. The Decoder Queue still acts as a cache for small loops like in Nehalem, and actually can bypass both the instruction cache and uop cache.
One of the critical differences between the uop cache in Sandy Bridge and the P4’s trace cache is that the uop cache is fundamentally meant to augment a traditional front-end. In contrast, the Pentium 4 attempted to use the trace cache to replace the front-end and relied on vastly slower fetch and decode mechanisms for any workloads which did not fit in the trace cache. The uop cache was carefully designed so that it is an enhancement for Sandy Bridge and does not penalize any workloads.
The uop cache is one of the most promising features in Sandy Bridge because it both decreases power and improves performance. It avoids power hungry x86 decoding, which spans several pipline stages and requires fairly expensive hardware to handle the irregular instruction set. For a hit in the uop cache, Sandy Bridge’s pipeline (as measured by the mispredict penalty) is several cycles shorter than Nehalem’s, although in the case of a uop cache miss, the pipeline is about 2 stages longer. The uop cache increases performance by more consistently delivering uops to the back-end and eliminating various bubbles in the fetch and decode process. For example, the 16B fetch, length changing prefixes and the decoding restrictions all limit the traditional front-end, whereas the uop cache ignores those issues and can achieve higher performance. Generally, the uop cache seems to avoid the problems of a trace cache, while delivering most of the benefits in a much more power efficient manner.