New Itanium Microarchitecture at ISSCC 2011

New Itanium Microarchitecture at ISSCC 2011

Microprocessor vendors and researchers use the International Solid State Circuits Conference as a venue to share novel techniques and advertise upcoming products. The advanced program for 2011 was just released, and hints at some rather interesting presentations slated for next February – although the abstracts will not be available for another week. There are roughly 10 papers on mainstream microprocessors that will be discussed over the course of two days. Many of these microprocessors, such as IBM’s z196, AMD’s Bobcat or Intel’s Sandy Bridge have been discussed before in different venues. However, Intel managed to slip in a 100% novel disclosure regarding the high-end server oriented Itanium family into the ISSCC advanced program.

Previously, Intel had indicated that the Itanium family would skip from the 65nm 4-core Tukwila directly to the 32nm 8-core Poulson. This was largely because delays in two successive generations (90nm and 65nm) resulted in a product portfolio that was two process nodes behind mainstream servers and one node behind other high-end server competitors. At the same time, Intel claimed that Poulson was a new microarchitecture. The reception was generally skeptical, based on history. The various Itanium microarchitectures have all largely been identical or highly similar to the original 180nm single core McKinley design – which was started in the mid to late 1990’s and arrived in 2002. In contrast, the x86 cores had been radically changed no less than four times (P4, Prescott, Core 2, Nehalem). This difference in pace and resources is largely because Itanium is simply a much lower volume product and has less impact on the bottom line. It is probably comparable (within a factor of two) to the high-end Xeons in terms of revenues, but has higher development costs because it cannot re-use the core design of Intel’s high volume parts. So to some extent, Itanium has been constrained by the business model; one of the aims of QPI is to share as much development between high-end Xeons and Itanium, presumably to allow for more microarchitectural improvements.

However, the 32nm Poulson genuinely appears to be a reasonably novel microarchitecture. Paper 4.8 at ISSCC is entitled “A 32nm 3.1 Billion Transistor 12-Wide-Issue Itanium Processor for Mission-Critical Servers.” Finer details are not yet available, but this still implies a significant departure from the current microarchitecture. This is certainly good news for Itanium customers. Intel and HP were contractually obliged to continue with Itanium through 2011, but the future was uncertain. Developing a new microarchitecture suggests that Intel and HP see a reasonably bright future over the next 5-10 years and are willing to invest to make Itanium attractive to customers and more competitive with the alternatives (both from other RISC vendors and x86). Of course, the results of that investment have yet to be seen – but the ISSCC paper will provide a first glimpse. In the mean time, it is possible to speculate what Poulson might look like.

The current Itanium core is a dual threaded, 6 issue (or 2-bundle) design and already had extremely high IPC (instructions per cycle) for single threaded workloads. The two threads would switch based on long latency events, such as cache misses, and was moderately effective at hiding some memory latency. Itanium relies on compilers to aggressively schedule instructions for parallel execution into bundles. One of the challenges is that the parallelism in software is quite varied. Most server workloads like transactional databases, Java applications and ERP tend to have very low IPC (typically less than 0.5) and cannot utilize a 6-wide core, due to branches and memory latency. On the other hand, decision support or analytic databases are much more regular and achieve very high overall utilization (IPC > 1.5).

Poulson seems to go even further in the direction of high IPC, given that it is a 12-wide design. In comparison, other server processors from Intel, IBM and AMD are between 4 and 6-wide (although the instructions are quite different and encode more work per instruction than Itanium). This is even more puzzling, considering that Itanium needs to effectively address low IPC workloads (e.g. like TPC-C), where issuing 12 instructions from the same thread is impossible. There are two reasonable design choices that might explain the new 12-wide microarchitecture.

The most likely theory is that Poulson has simultaneous multi-threading, rather than the more primitive switch on event model used in previous generations. Poulson is a 4-bundle core, where each bundle is 3 instructions. A reasonable way to keep such a wide microarchitecture busy is 4-way simultaneous multi-threading (i.e. 32 threads per chip). Each thread would have a separate queue of bundles, and the microprocessor would select one or more bundle from each thread. For high IPC workloads that are compiled appropriately, 4 bundles could come from a single thread – capturing the benefits of a wide architecture. The microarchitecture would probably be clustered, with a small latency penalty (1-2 cycles) for bypassing between different parts of the core – to simplify resources such as the forwarding network and register files. To improve performance for low IPC applications, each of the 4 threads could issue a single bundle; or 2 threads could issue 2 bundles each. Compiling applications to target a single bundle could even decrease the number of NOPs in the instruction stream. If this were the case, the question is whether the software or hardware controls the configuration of the core as a 1, 2 or 4 simultaneous threads. Depending on how the clustering was done, it might even be possible to re-use the existing Itanium cores as one ‘half’ of the new Poulson core.

A second option is that the new Poulson core is an out-of-order design. This is complementary rather than mutually exclusive, with simultaneous multi-threading, so both options are possible. However, since Itanium was specifically conceived to avoid out-of-order execution, this seems a little less likely – it would definitely be a philosophical retreat, although possibly a wise one. One advantage of this approach is that the NOPs in the instruction stream could be dynamically removed. Perhaps more importantly, an out-of-order Itanium would be able to dynamically schedule older code (e.g. for the existing 6-issue machines) onto a wider 12-issue design. However, this would be a fundamental redesign and re-use very little of the previous design and introduce many new complexities.

Intel’s presentation at ISSCC will undoubtedly reveal these details of Poulson and many more. Other areas of interest include the mix of execution units, pipeline depth, frequency, cache hierarchy and power management. Cumulatively, all this information will give a good idea of how much effort Intel invested – and what their goals for the Itanium architecture are over the next 5-10 years. At the end of the day, Poulson will be judged based on the results and not just the novelty of the microarchitecture; but changes there will go a long ways towards improving performance.
Discuss (102 comments)