General Purpose VLIW is Dead
Itanium was originally conceived in the early 1990’s by the architects and engineers who had worked on HP’s PA-RISC. Many of them were convinced that dynamic instruction scheduling and out-of-order execution would ultimately prove to be too complex and power hungry. They believed that single threaded performance would not scale in the future. It is certainly true that many of the circuits in out-of-order designs can be power hungry – the re-order buffer, schedulers and renaming logic are fairly complicated and do not scale well to very large sizes. Instead of relying on extensive scheduling and renaming logic, the architects from HP and Intel took a different approach – embracing a VLIW (Very Long Instruction Word) philosophy. Itanium pushed the instruction scheduling burden onto the compiler and designed a number of ISA features that would assist software scheduling. The hardware was intended to be extremely simple with totally static scheduling. In theory, removing all the complicated scheduling and out-of-order logic would reduce power and scale better to smaller process nodes.
However, these gloomy predictions about out-of-order execution were not entirely accurate. The scheduling windows of modern CPU cores like Bulldozer or Sandy Bridge are 3-4X larger than aggressive x86 designs like the Pentium Pro (40 entry ROB) and larger still than the K5 (16 entry). The execution width of out-of-order designs has grown more slowly. Early microarchitectures were 2 and 3-issue wide, and have grown to 4-issue, but each uop in a modern core is much more powerful than before. Considering these factors, the execution width has probably grown by a factor of 2 – and more if a workload can be vectorized. In terms of single threaded performance, dynamic scheduling and out-of-order designs have significantly improved over the last decade, contrary to expectations from the early Itanium architects.
The first two Itanium designs (Merced and McKinley) stayed close to the original philosophy. The compiler grouped instructions that could be executed in parallel into 3-wide bundles and the hardware statically executed up to 2 bundles per cycle. With impressive execution resources and an extremely aggressive cache hierarchy, Itanium achieved incredibly high single threaded performance – especially for floating point code. Itanium was intended to be an open and commodity replacement for strongholds of proprietary architectures like workstations, HPC and servers. Over time though, the market for Itanium shifted, and x86 took over workstations and HPC and large swathes of the server market. Today, Itanium (and competing Power and SPARC processors) are largely used for highly scalable and reliable servers. Later microarchitectures, such as Montecito and Tukwila added multi-theading and other features, but largely retain the same design as the 180nm McKinley.
Poulson is a radical departure from the initial Itanium philosophy, and takes into account years of experience, and technology and market changes. Poulson abandons the idea of simple hardware controlled by the compiler and is the first dynamically scheduled Itanium design, with modest out-of-order execution. The microarchitecture was rebalanced to favor server workloads, rather than HPC and workstations. Poulson has a more sophisticated multi-threading and multi-core architecture, recognizing the need for tolerating memory latency and technical changes in the industry that have occurred since the first Itaniums debuted on 180nm in 2000. For all the changes though, some things remain the same. Poulson focuses on wide execution and instructions-per-cycle (IPC) rather than frequency, and has excellent reliability features. The die size is a substantial 544mm2 for massive on-die caches and scalability features for large servers.
This report describes in detail the architecture and 11-stage pipeline of Poulson, a multi-threaded, 6 issue super-scalar, in-order Itanium microprocessor. Poulson is an 8-core design, manufactured on Intel’s 32nm high performance bulk process. As our ISSCC Preview discussed, Poulson is the successor to the 65nm quad-core Tukwila design – skipping the 45nm node. The accumulated scheduling delays from Montecito’s Foxton power controller and the system interfaces on Tukwila were significant enough that Intel opted to simply skip a node for competitive performance.
Poulson has already taped out, which is a requirement for ISSCC papers. But products are slated for release in 2012 (most likely in the first half), reflecting the extremely long test and validation process for mission critical systems. As a result, Intel did not completely describe the architecture in their ISSCC paper. Fuller disclosures will come at Hot Chips this summer and perhaps future conferences. Our report will carefully indicate what is known, but intelligently speculate on some of the unknown architectural features.
One area that Poulson improved is multithreading. Poulson is said to use “fine-grain multithreading”, but few additional details were given at ISSCC. Unfortunately, the terminology around multithreading is inconsistent throughout the industry, Figure 1 shows the three common styles of multithreading. The 90nm Montecito introduced 2-way, switch-on-event multi-threading (SoEMT, also referred to as vertical or coarse-grain multithreading). Each core has two sets of register files, return address stacks (for branch prediction), and advanced load address tables (ALATs – used for speculative memory accesses). The two threads are not simultaneous, so the core exclusively executes one thread until a switch is triggered by L3 misses, ALAT invalidation (used for spinlocks), time out counters (for ensuring fairness) or software/power management hints. Switching between the two threads requires a pipeline flush plus 7 cycles of delay. With a total of 15 cycles of overhead for switching threads, the main benefit for SoEMT is hiding memory latency and simplicity of design.
Figure 1 – Multithreading Strategies
Traditionally, fine-grain multithreading (FGMT) refers to a scheme where only one thread issues instructions in each cycle. However, the threads are interleaved on a cycle by cycle basis, and there is no penalty for switching. This is commonly used with in-order microprocessors and achieves higher performance than SoEMT by improving throughput and tolerating many different types of latency. Simultaneous multithreading (SMT) is more complicated and can issue instructions from multiple threads in a single cycle. SMT can be used on in-order designs (e.g. POWER6, Cell PPE and Xenon), but it is most common for out-of-order microprocessors. As Figure 1 shows, SMT is the most efficient and has the highest performance, but fine-grained multithreading is still a huge improvement over SoEMT. Poulson could conceivably use either SMT or FGMT, although the terminology seems to suggest the latter.
Discuss (208 comments)