The instruction fetch pipeline in Jaguar and Bobcat is a total of six stages. However only the first three stages are on the critical path – the latter three stages occur in parallel with instruction decoding and are primarily used to check that the fetch address was correct and resteer the pipeline if necessary (e.g., to recover from a branch misprediction).
The L1 instruction cache (L1I) is a 32KB, 2 way associative design with 64B cache lines. The instructions are parity protected, with a pseudo-Least Recently Used (LRU) replacement algorithm. The L1 Instruction Translation Look-aside Buffer (ITLB) is fully associative with 32 entries for 4KB pages and 8 entries for 2MB pages. The larger 1GB pages are fragmented into multiple 2MB pages, since the target markets are relatively unlikely to use huge pages.
Instruction fetching occurs in 32B fetch windows, which are the basis for much of the front-end, so fetching a full cache line takes at least two cycles.
Jaguar and Bobcat predict branches using a variety of structures optimized for different branch behavior, not only taking advantage of different branch types (e.g., direct vs. indirect), but also the branch density. While the prediction is complicated, it reduces power while delivering high accuracy and performance.
When a branch is detected, the IP address of the fetch window indexes into the Branch Target Buffer (BTB), which is coupled to the L1I. The BTB is a two level structure; the L1 is optimized for sparse branches and the L2 handles dense branches. The L1 BTB is conceptually part of the instruction cache; it tracks two branches for every 64B line (1024 entries total) and can simultaneously predict both branches with only a single cycle penalty for taken branches. The L2 BTB is allocated dynamically and tracks an extra 2 branches per 8B region and also contains 1024 entries. The L2 BTB is slower and makes a single prediction per cycle, with a two cycle penalty for the first dense branch prediction and only a single cycle for any subsequent prediction. The BTB design saves power by only engaging the L2 when code actually has 3 or more branches per cache line, exploiting branch density to reduce power.
Conditional near branches are implicitly predicted as not-taken, which saves space in the BTB. Once such a branch is taken, it is set to always taken in the BTB. Should the always taken branch subsequently fall through, it switches to a dynamic neural network predictor using 26-bits of global history. The Bobcat and Jaguar branch predictor proved to be so successful that it was later adopted for AMD’s big cores, particularly Piledriver.
Another BTB optimization is that the L1 and L2 BTBs only predict target addresses for direct branches that are in the same 4KB page as the IP of the fetch window. A 32-entry out-of-page target array handles branch targets with up to 256MB of displacement for the L1 BTB. Sparse branch targets with >256MB of displacement, and dense branches with out-of-page targets are resolved by the branch target address calculator with a four cycle penalty.
Near calls and the associated returns are predicted by a 16 entry Return Address Stack (RAS). The RAS can recovery from most forms of misspeculation without corrupting the predictions. For cases that cannot be recovered, the RAS is invalidated to avoid mispredictions.
Indirect branches with multiple targets are predicted using the IP address and 26-bits of global history to index into the 512-entry indirect branch target array. There is an extra 3 cycle penalty for any indirect branch predictions, but indirect branches with a single target and 256MB or less displacement are tracked through the lower latency out-of-target array.
If a cache line is only being used for instructions, then the branch information in the L1 BTB is compressed and stored in the ECC bits of the L2 cache when the line is evicted and can be reloaded. The information is lost if the cache line is hit by a store, or is evicted to main memory. L1I misses trigger a 64B fetch request to the L2, and also prefetch one or two additional cache lines.
Once the fetch address has been determined, the 32B of instructions from the L1I are sent to the Instruction Byte Buffer (IBB), which acts as a decoupling queue between the fetch and decoding stages. The IBB entries are 16B each, so a fetch will typically fill two at a time, and Jaguar has 16 entries, versus 12 for Bobcat. A small loop buffer tracks four recent 32B fetches and can bypass the instruction cache lookup mechanism to save power.
Discuss (86 comments)