Poulson’s performance has not been discussed, but there are enough clues to put together some intelligent estimates. Given the scope of the changes, performance per core could improve by 25-40%, through a combination of higher frequency and IPC. On top of that, the core count has doubled, so the net gain could be as high as 2.8X. For workloads that are memory and I/O bandwidth limited, the gains will be substantially smaller, but still significant.
Poulson’s microarchitecture (Figure 8) should increase instructions per cycle by 10-15%. Dynamic scheduling will boost IPC, although to a lesser extent than full blown out-of-order execution; and removing the NOPs is also fairly helpful. The 12-wide back-end can swiftly clear all the stalled instructions when a cache miss is resolved; helping average IPC, even if the core is only 6-wide due to fetch and decode constraints. Poulson’s better multithreading and replicated DTLBs will raise utilization of the execution pipelines and data caches significantly and help hide low latency events (e.g. L1 or L2 cache misses). The only loss of IPC in the core should come from scaling back to 2 memory pipelines – but for most software, this is a small factor.
Outside the core, the impact of the L3 cache redesign is complicated to assess. Moving to a large, highly associative shared cache should raise the hit rate, by reducing associativity conflicts and eliminating duplicated data. On the other hand, the L3 cache per core dropped from 6MB to 4MB, so capacity conflicts are a bigger issue. The higher L3 latency over the ring could significantly hurt performance. The total memory and I/O bandwidth grew by 33%, but the net per-core bandwidth dropped by 50%, since Intel doubled the core count.
Figure 8 – Poulson and Tukwila Microarchitectures
A top down analysis yields numbers in the same range. The change in process technology alone should raise performance by 4X. However, Poulson is ~0.9X the power and ~0.8X the area of Tukwila – in particular using 118mm2 less for the cores (276mm2 vs. 158mm2, or ~0.6X). So the performance gain should be somewhere around 60-70% of the ideal 4X, which is in rough concordance with our bottoms up estimates.
Poulson is the first truly new Itanium microarchitecture in over a decade. It is also the first design that has been based on feedback and testing of real systems. The first two generations – Merced and McKinley – were based on simulations and a nascent software ecosystem. Simulators are critical for guiding development, but they are also no substitute for performance measurements of real software running on real hardware collected over the years.
The irony is that Poulson departs from the principles behind Itanium and follows a much more nuanced approach to computer architecture. The Itanium architecture and early implementations were a reaction to the increasing hardware complexity in the early 1990’s. They were based around the theory that the hardware should be very simple and almost totally managed by software. Poulson is a marriage of the explicit compiler driven parallelism of the Itanium architecture to dynamic scheduling. In some cases, the compiler can achieve remarkable results – and simple hardware is the most efficient. But general purpose workloads, especially for servers, are unpredictable – with branches, cache misses and TLB misses all disrupting the careful compiler scheduling. Poulson’s dynamic scheduling deals with the unpredictable nature of real software, while also taking advantage of any explicit parallelism that the compiler can extract. While most of Intel’s efforts focused on the core pipeline, moving to a shared last level cache is a significant improvement from a multi-core system architecture standpoint.
Overall, Poulson is both an interesting and promising design. The performance will be a tremendous improvement for the Itanium line, and HP’s servers in particular will benefit. It is unlikely that Poulson will exceed the performance of IBM’s contemporary Power microprocessors, as the latter have a substantially larger power budget and vastly more expensive packaging. However, the performance gap will narrow considerably. Itanium’s performance lead over SPARC based rivals from Oracle and Fujitus will only grow, as Intel leapfrogs to 32nm. Longer term though, all three architectures will face renewed competition from high-end x86 chips, such as the 32nm Westmere-EX.
Discuss (208 comments)