The changes to Poulson’s microarchitecture are comprehensive and encompass every part of the pipeline, but instruction fetch is perhaps the least impacted. Fine grained multi-threading is the biggest change for the fetch part of the front-end. Previously, fetching was essentially single threaded, while for Poulson, it must be shared between two threads dynamically. In all likelihood, the two threads alternate cycles based on priority counters with the goal of keeping the further parts of the pipeline full.
The first pipeline stage in Poulson is generating the target instruction pointer (IP) to determine the next bundle for execution, either by branch prediction or incrementing the IP. Branching in Itanium is fairly different than most other ISAs, especially x86. To start with, each bundle is 16B long, naturally aligned and contains 3 instructions; in x86 instructions are variable length and can be byte aligned. There are three types of branches – IP relative, long and indirect. IP relative branches have 21-bit (16MB) displacement; long branches have a 60-bit displacement that span the entire address space and use two instruction slots in a bundle. Indirect branches use the 8 dedicated 64-bit branch registers, which are used to hold target addresses. Itanium can predicate any instruction using 64 predicate registers – whereas most other ISAs only have limited conditional execution (e.g. CMOV in x86). So the compiler can remove some branches from the instruction stream entirely, at the cost of creating data dependencies. Itanium (like Power) also has a loop count register to perfectly predict loops with known iteration counts – which serves some of the same benefits as the loop predictor found in modern x86.
Figure 2 – Poulson Instruction Fetch and Comparison
Itanium has always had extensive branch prediction resources, because of the importance for commercial workloads and the need to deliver predictions in a single cycle. McKinley had a L1 branch cache (L1B) containing local branch history data for any line in the L1 instruction cache. The L1B and L1I were accessed together in a single cycle by every fetch and combined with global branch predictors to yield a final prediction for the next instruction fetch. The L1B was backed by an L2 branch cache with 24K entries for instructions that had been evicted from the L1I. Poulson also includes a dedicated cache for branch information, referred to as the FLB (first level branch cache), and likely includes an expanded L2B as well. The actual branch prediction algorithms are unknown, but will hopefully be disclosed in the future.
After IP generation, Poulson accesses the ITLB and instruction cache to fetch up to 32B, or two bundles (containing up to 6 instructions). Overall the instruction caches seem to be largely similar to Tukwila. One of the more unique aspects of Itanium is that starting with Montecito, the L1 and L2 caches are split for data and instructions. Most other microarchitectures only have separate L1 caches, with a unified L2. Like most architectures, Itanium does not keep the instruction caches coherent for self-modifying code, a direct contrast to x86. Poulson’s L1I is a 16KB, 4-way design with 64B cache lines. It is backed by a larger 512KB, 8-way L2 with 128B cache lines and 9-cycle latency – 2 cycles longer than Tukwila’s L2I. The 32-entry, fully associative instruction TLB is accessed in parallel with the L1I and only holds 4KB pages. The larger L2 ITLB can hold translations for any page size (4KB-4GB) and up to half of the L2 ITLB entries can be software managed. While it is possible that there is a small buffer between fetch and decode, none was mentioned by Intel.
Discuss (208 comments)