The Front-end: Fetch Phase
Nehalem’s fetch phase has been fairly substantially modified, although Figure 2 does not show some of these details. The instruction fetch unit in the diagram below contains the relative instruction point (RIP), which is replicated, one for each thread context.
Figure 2 – Front-end Microarchitecture Comparison
The instruction fetch unit also contains the branch predictor, which is responsible for predicting the RIP of the next instructions to be fetched. Nehalem’s branch predictors are not shown in detail, partially because some of the details are unknown – Intel simply states that they use “best in class” branch predictors and that their predictors are tuned to work with SMT. Intel did confirm that Nehalem continues to use all of the special predictors from the previous generations, such as the loop detector, indirect predictor, etc.
Once the branch predictor has determined that a branch is taken, the branch target buffer (BTB) is responsible for predicting the target address. Nehalem augments the previous generation of branch prediction by using a two level BTB scheme. For reference, Barcelona uses a 2K entry BTB for direct branches, and a 512 entry indirect branch target array.
Nehalem’s two level BTB is designed to increase performance and power efficiency by improving branch prediction accuracy for workloads with larger instruction footprints (such as databases, ERP and other commercial applications). At this time, Intel is not describing the exact internal arrangement, but it is very possible to make an educated guess or two.
At a high level, there are two possibilities. The first possibility is that the two BTBs could use the same predictor algorithm, but one accesses a smaller history file that contains the most recently used branch RIPs and target RIPs. If that were the case, then the relationship between the BTBs would be the same as the relationship of an L1 cache to an L2 cache (remember that branch targets have fairly good locality). For example, the first level BTB could hold 256-512 entries, while a larger second level BTB could hold 2-8K entries. If a branch RIP is not in the first level BTB, then it would check the second level to make a target prediction. This approach has the benefit of using relatively little power, except when the instruction footprint is very big (i.e. does not fit in the L1 BTB). However, in that situation the extra power saved by fewer branch mispredictions will more than compensate for the extra power used by the L2 BTB.
The second alternative (which is much less likely) is that the first and second level BTBs actually use different prediction algorithms AND different history files. For example, the first level BTB could be configured to use a very simple and fast algorithm with a relatively small history table, while the second level would use a slower and more accurate algorithm and the second level BTB would be configured as an over-riding predictor. If the second level predictor disagrees with the first predictor, then it overrides the first level and has to fix up the pipeline by getting rid of any erroneously fetched instructions and starting to fetch from the newly predicted RIP. This sort of scheme is relatively unlikely since it is very power inefficient. In a two level over-riding predictor, the common case is that the L1 BTB and L2 BTB both independently come up with the correct branch target – which means that most of the time the L2 BTB is just wasting power. The over-riding predictor is only truly energy efficient when the L1 BTB is incorrect, and the L2 BTB is correct – which is a very small percentage of the time. While we do not believe Intel used this approach, it is worth mentioning since it is a possible option.
Another improved branch target prediction mechanism in Nehalem is the return stack buffer (RSB). When a function is called, the RSB records the address, so that when the function is returned, it will just pick up where it left off instead of ending up at the wrong address. A RSB can overflow if too many functions are called recursively and it can also get corrupted and produce bad return addresses if the branch predictor speculates down a wrong path. Nehalem actually renames the RSB, which avoids return stack overflows, and ensures that most misspeculation does not corrupt the RSB. There is a dedicated RSB for each thread to avoid any cross-contamination.
The fetch unit takes the next predicted address for each thread (which is usually just the next address) and then indexes into the ITLB and L1I cache. The ITLB is statically partitioned between both threads and has 128 entries for 4KB pages arranged with four way associativity. Each thread has 7 fully associative and dedicated entries for large pages (2M/4M) in addition to the shared small page entries. The instruction cache is 32KB and 4 way associative, with competitive sharing between threads. Each fetch into the cache grabs 16B of instructions which go into the pre-decode and fetch buffer. Then up to 6 instructions are sent from the buffer into the 18 entry instruction queue. The instruction queue in Core 2 is used as a loop cache (Loop Stream Detector or LSD), so that the instruction fetch unit can actually shut down for small loops. The instruction queue for Nehalem merely acts as a buffer for instructions before they are decoded, because the loop cache is located later in the pipeline in the decode stage.