Silvermont’s instruction fetching spans the first three pipeline stages with a number of improvements to extract greater parallelism. While Silvermont has the same peak instruction throughput (instructions per cycle, or IPC) as Saltwell, the out-of-order execution means that there are substantial benefits to a more aggressive fetch strategy. The Saltwell core had two threads, but would stall easily on cache misses or other events; in contrast, Silvermont can continue to execute past stalled instructions and benefits from filling the entire out-of-order window. As a result, the branch prediction in Silvermont is vital for performance and power efficiency.
The Saltwell branch predictor is not particularly aggressive. It uses a hybrid gshare mechanism to predict the direction of a branch by XORing the instruction pointer (IP) with a history of recent conditional branches to index into an 8K entry branch history table that indicates direction. The branch target buffer (BTB), which predicts the target address of a taken branch is 128 entries and 4 way associative. For calls and returns, there is an 8 entry return stack buffer for each thread.
For Silvermont, the branch predictors are split into two separate components that work together to balance frequency, accuracy, and power. The first predictor is a BTB that controls the instruction fetching and is tailored for low latency predictions. The second overriding predictor is later in the decode stage; it has substantially more time and information available to make predictions, which increases the accuracy and reduces power requirements. The second predictor actually controls the speculative instructions issuing into the back-end and can override earlier predictions, improving the overall performance and efficiency.
On Silvermont, the IP indexes into the Branch Target Buffer (BTB) to determine the next fetch address. The BTB includes a 4 entry Return Stack Buffer (RSB) for handling calls and returns detected in the instruction stream; the RSB can recover from certain types of corruption caused by branch mispredictions, but it is not fully renamed. Taken branches that are correctly predicted by the BTB incur a single cycle bubble in the instruction fetch stream, although the prefetch buffers should readily absorb such small delays.
The L1 instruction cache is largely unchanged from Saltwell; it is a 32KB, 8 way associative design with 64B lines that also caches pre-decode bits for previously decoded instructions. Cache lines are parity protected and the replacement policy is pseudo-least recently used (LRU). The instruction cache also includes a new prefetcher that can request the next cache line from the L2 to reduce latency. Instruction streams are linear except for branches and exceptions, so this is a highly effective and low cost approach.
Once an address is predicted, the instruction fetcher probes the L1 instruction cache. Since the capacity is less than or equal to the minimum page size times the associativity (4KB × 8), the cache access occurs in parallel with checking the L1 Instruction Translation Lookaside Buffer (TLB). The L1 ITLB for Silvermont is 50% larger, with 48 fully associative entries for translating 4KB pages. Larger 2MB pages are fractured into 4KB pages in the ITLB and there is no support for 1GB pages. Stores to code pages (e.g., self-modifying code) will invalidate the ITLB page, similar to the P4.
The instruction fetcher retrieves 16B of instructions from the instruction cache into one of six prefetch buffers, which decouple the fetch and decode stages. The prefetch buffers also exist in Saltwell, but are statically partitioned between the two active threads, so Silvermont can effective prefetch twice as far down the instruction stream. One advantage of this approach is that the fetching can effectively run ahead of the decoding, and insulate the rest of the pipeline from any fetch related stalls.
Discuss (409 comments)