Shared Instruction Fetch
Sharing between cores is a key element of Bulldozer’s architecture, and it starts with the front end. The front-end has been entirely overhauled and is now responsible for feeding both cores within a module. Bulldozer’s front end includes branch prediction, instruction fetching, instruction decoding and macro-op dispatch. These stages are effectively multi-threaded with single cycle switching between threads. The arbitration between the two cores is determined by a number of factors including fairness, pipeline occupancy and stalling events. Each of these major stages is decoupled from the next, by an appropriate queue or pair of queues. The front-end for Bulldozer is shown below in Figure 2.
Since Bulldozer is a high frequency design, branch prediction is critically important. Deeper pipelines tend to have more instruction in-flight at a given time, thus mispredicts can result in squashing more instructions. The number of instructions squashed is directly related to the wasted energy and performance. Historically speaking, Intel has invested far more resources and expertise in branch prediction and as a result has the most advanced and highest performance predictors. While Bulldozer is a substantial improvement over Istanbul, the end results are hard to assess without seeing the full details or a testable product.
Figure 2 – Bulldozer Instruction Fetch and Comparison
The branch predictor is shared by the two cores in each module and decoupled from the instruction fetching via a pair of prediction queues (one queue per core). The branch predictor can run-ahead and will continue to predict new relative instruction pointers (RIPs) unless the queues are full.
The first step in branch prediction is determining the direction – whether a branch is taken or not. AMD previously used a local predictor, a global predictor and a selector that would choose which of the two predictors to use. However, they were extremely coy about the predictors used in Bulldozer, other than to indicate that they did not use multi-level predictors. It is possible that AMD included a loop detector, something Intel introduced in the Pentium M.
Once a branch is predicted as taken, the next step is to determine the target. For branches with a single target address, the branch target buffer (BTB) has been substantially expanded and now uses a two level hierarchy, similar to Nehalem. The L1 BTB is a 512 entry, 4-way associative structure that resolves predictions with a single cycle penalty to the pipeline. The L2 BTB is much larger with 5120 entries and 5-way associativity, but the extra capacity will cost additional latency for a L2 BTB hit. The BTBs in Bulldozer are competitively shared by both cores, but provide greater coverage than the 2K entry BTB in Istanbul.
The other key structures used to predict the target address of a branch were not described in detail by AMD. They confirmed that for indirect branches (those with more than one target address), Bulldozer also includes a 512-entry indirect target array. Istanbul included a 512 entry indirect array; it is possible that this was kept the same size, although it would make sense to increase the number of entries to account for both cores. Bulldozer includes the familiar call/return stack, which is replicated per thread, rather than shared. Istanbul’s 24 entry return address stack could be corrupted by a branch misprediction, which would result in subsequent returns being mispredicted. Bulldozer has mechanisms to repair the return stack, avoiding this corruption issue and decreasing return mispredictions; a feature first seen in Nehalem.
The branch prediction and the RIP queue can effectively run ahead of the instruction fetch unit in Bulldozer. This helps the two cores smoothly share the branch prediction hardware and tolerate longer latencies in the front-end. Just as importantly, by having multiple RIPs ready at a given point in time, the fetch unit can prefetch the instruction stream for branches in the BTBs and indirect array. This prefetching hides some of the fetch latency and enables greater memory level parallelism for the instruction caches.
Once a RIP is placed into the prediction queue and passed to the next stage, the fetch unit accesses the dynamically shared ITLBs and L1 I-cache. The L1 ITLB is fully associative with 72 entries for the various page sizes. Istanbul did not have 1GB pages for the ITLBs, and it is almost certain this has been rectified for Bulldozer. The break down of the 72 entries in the L1 ITLB has not been disclosed, but judging by design choices made in prior generations, over half the entries will be for 4KB pages, with a lesser number for larger (2MB or 1GB) pages. The backing L2 ITLB which only holds 4KB pages has been expanded, with 512 entries and 4-way associativity.
The L1 instruction cache should be very familiar, as it has the same organization as Istanbul’s – 64KB and 2-way associative and probably contains similar pre-decode information. What is most puzzling about the L1I is the low associativity – essentially one way for each of the two cores sharing the instruction cache.While each cache line is 64B, the fetcher retrieves 32B of instructions each cycle into the Instruction Byte Buffers (IBB), taking two cycles to complete a fetch.
The IBB is the last stop before decoding and acts as the decoupling queue between fetch and decode. Accordingly, there are two IBBs, one dedicated per core. Each IBB contains 16 entries, sometimes called dispatch windows. Each window holds 16B of x86 instructions, thus the total IBB capacity is 256B per core.