The Fetch Phase
The front end of Barcelona is fairly complex and has been substantially improved over the K8, as shown in Figure 2 below. Each cycle, Barcelona fetches 32B of instructions from the L1I cache into the predecode/pick buffer. The previous generation K8 fetched 16B each cycle, as does Intel’s Core 2. The instruction fetch was widened because many of the SIMD and 64 bit instructions are longer, and as these become more common, larger fetches are required to keep the rest of the core busy. Consequently, the pre-decode and pick buffer for Barcelona has been enlarged, to at least 32B, although it could be somewhat larger – the K8′s predecode buffer was 1.5x the fetch size, so a 48B buffer might not be out of the question.
This makes sense as Barcelona is targeted first and foremost at servers, where 64-bit mode is common. Core 2, on the other hand, was designed with more focus on consumers, who purchase the majority of computer systems. The reality is that even now, 64-bit operating systems are extraordinarily rare for desktops, and especially notebooks; in those market segments, the additional benefit is more limited and may not be worth the resources.
Figure 2 – Comparison of Front-End Microarchitecture
The branch prediction in the K8 also received a serious overhaul. The K8 uses a branch selector to choose between using a bi-modal predictor and a global predictor. The bi-modal predictor and branch selector are both stored in the ECC bits of the instruction cache, as pre-decode information. The global predictor combines the relative instruction pointer (RIP) for a conditional branch with a global history register that tracks the last 8 branches to index into a 16K entry prediction table that contains 2 bit saturating counters. If the branch is predicted as taken, then the destination must be predicted in the 2K entry target array. Indirect branches use a single target in the array, while CALLs use a target and also update the return address stack. The branch target address calculator (BTAC) checks the targets for relative branches, and can correct predictions from the target array, with a two cycle penalty. Returns are predicted with the 12 entry return address stack.
Barcelona does not fundamentally alter the branch prediction, but improves the accuracy. The global history register now tracks the last 12 branches, instead of the last 8. Barcelona also adds a new indirect predictor, which is specifically designed to handle branches with multiple targets (such as switch or case statements). Indirect branch prediction was first introduced with Intel’s Prescott microarchitecture and later the Pentium M. Branches with a single target still use the existing 2K entry branch target buffer. The 512 entry indirect predictor allocates an entry when an indirect target is mispredicted; the target addresses are indexed by the global branch history register and branch RIP, thus taking into account the path that was used to access the indirect branch and the address of the branch itself. Lastly, the return address stack is doubled to 24 entries.
According to our own measurements for several PC games, between 16-50% of all branch mispredicts were indirect (29% on average). The real value of indirect branch misprediction is for many of the newer scripting or high level languages, such as Ruby, Perl or Python, which use interpreters. Other common indirect branch common culprits include virtual functions (used in C++) and calls to function pointers. For the same set of games, we measured that between 0.5-5% (1.5% on average) of all stack references resulted in overflow, but overflow may be more prevalent in server workloads.