The job of the front-end of Sandy Bridge is to consistently deliver enough uops from the instruction stream to keep the back-end occupied. Even the highest performance out-of-order execution CPU will deliver poor results without a capable front-end. For any modern x86 CPU, this is quite challenging. The delivery of the instruction stream is frequently interrupted by branches, and a taken branch may introduce a bubble into the pipeline as instruction fetching is redirected to a new address. Decoding from x86 instruction bytes into uops is complicated by the variable length nature of x86 instructions, the multitude of prefixes, and exceedingly complex microcoded instructions. The Sandy Bridge architects spent tremendous effort to improve all these facets of the front-end. One of the most novel features of the Sandy Bridge microarchitecture is the uop cache, which contains fixed length decoded uops, rather than raw bytes of variable length instructions. A hit in the uop cache bypasses substantial portions of the front-end and improves the delivery of uops to the back-end. The uop cache is conceptually akin to the trace cache from the Pentium 4, but differs in the details – it has substantially been refined and modified, as we will explore in the next page.
Figure 2 – Sandy Bridge Instruction Fetch and Comparison
One of the areas that Intel’s microarchitects concentrate on most keenly is branch prediction. It seems like hardly a generation goes by without Intel improving the branch predictors in one fashion or another. The rationale is fairly straight forward. Many improvements that increase performance also increase the energy used; to maintain efficiency, microarchitects must ensure that a new feature gains more performance than it costs in energy or power. In contrast, branch prediction is one of the few areas where improvements generally increase performance and decrease energy usage. Each mispredicted branch will flush the entire pipeline, losing the work of up to a hundred or so in-flight instructions and wasting all the energy expended on those instructions. Consequently, avoiding expensive mispredictions with better branch predictors is highly desirable and a prime focus for Intel.
The branch prediction in Sandy Bridge was totally rebuilt for better performance and efficiency, while using the same amount of resources. Sandy Bridge retains the four branch predictors found in Nehalem: the branch target buffer (BTB), indirect branch target array, loop detector and renamed return stack buffer. Sandy Bridge has a single BTB that holds twice as many branch targets as the L1 and L2 BTBs in Nehalem, yielding better branch prediction coverage. The single level design was accomplished by representing branches more efficiently and essentially compressing the number of bits required per branch. For example, any taken branch in the predictor must include the displacement from the current IP; branches with a large displacement can be held in a separate table so that most branches (which have a short displacement) do not require as many bits. While Intel did not disclose the number of targets for Nehalem, the P4 BTB had 4K targets, and it seems reasonable that Sandy Bridge has 8K-16K. Just as importantly, the global branch history, which tracks the most recently predicted (and also executed) branches, increased in size to capture a longer pattern history. Again, the number of bits used did not increase – instead, Intel omits certain branches from the pattern history that do not help to make predictions.
Nehalem enhanced the recovery from branch mispredictions, which has been carried over into Sandy Bridge. Once a branch misprediction is discovered, the core is able to restart decoding as soon as the correct path is known, at the same time that the out-of-order machine is clearing out uops from the wrongly speculated path. Previously, the decoding would not resume until the pipeline was fully flushed.
The instruction fetch for Sandy Bridge is shown above in Figure 2. Branch predictions are queued slightly ahead of instruction fetch so that the stalls for a taken branch are usually hidden, a feature earlier used in Merom and Nehalem. Predictions occur for 32B of instructions, while instructions are fetched 16B at a time from the L1 instruction cache.
Once the next address is known, Sandy Bridge will probe both the uop cache (which we will discuss in the next page) and the L1 instruction cache. The L1 instruction cache is 32KB with 64B lines, and the associativity has increased to 8-way, meaning that it is virtually indexed and physically tagged. The L1 ITLB is partitioned between threads for small pages, with dedicated large pages per thread. Sandy Bridge added 2 entries for large pages, bringing the total to 128 entries for 4KB pages (for both threads) and 16 fully associative entries for large pages (for each thread).
The instruction fetcher will retrieve 16B from the instruction cache into the pre-decode buffer. The pre-decoder will find and mark the instruction boundaries, decode any prefixes and check for certain properties (e.g. branches). The pre-decoder throughput is limited to 6 instructions per cycle, until the 16B instruction fetch is consumed and the next one can begin. Since the pre-decoding is done in 16B chunks, average throughput can suffer at the end of a chunk. For instance, the first cycle could pre-decode 15B into 4 instructions, leaving 1B and 1 instruction for the second cycle and resulting in overall throughput of 2.5 instructions per cycle. Large immediates can have a similar impact on throughput as well. Once pre-decoded, the instructions are placed into the instruction queue for decode. The size of the instruction queue in Merom was 18 entries, it has almost certainly increased for Nehalem and Sandy Bridge, but the precise value is unknown.