Performance, Conclusions and References
While Intel has not disclosed the target frequency or TDP for Nehalem, it is reasonable to expect a slight loss in frequency, and a larger power envelop. The thermal envelop for Nehalem will increase, since the power budget allocated for the northbridge in a traditional system will become available for use. However, the package level thermal limit will probably continue to be under 150W, since that is the limit for affordable cooling solutions. Judging by Tukwila, incorporating a triple channel memory controller and 2 QPI links will probably cost around 12-20W of power . Within those limits, Nehalem will not be able to increase frequency and may actually decrease frequency slightly.
Nehalem’s performance will come from refining the already impressive Core 2 microarchitecture and coupling it with a high performance memory and I/O system with minimal frequency impact. In workloads that are not particularly bandwidth dependent, such as general integer applications, Nehalem will provide a moderate boost over the previous generation. The performance gains will largely come from the integrated memory controller, microarchitectural innovation and circuit techniques (the latter was not discussed by Intel at IDF and will probably be saved for later disclosure). For floating point and HPC workloads that are typically bandwidth bound, Nehalem will be nothing short of a miracle – with performance gains of 2X or better. Commercial server workloads such as OLTP databases, decision support and virtualization will certainly benefit from more bandwidth and lower latency as well, but not to the same extent as HPC or floating point applications. Of course, when thinking about performance it is essential to also keep time in mind – Beckton will be about a year behind Gainestown and Bloomfield.
When the first implementation of Nehalem comes to the market, it will be competing primarily against AMD’s 65nm Barcelona, both of which are shown below in Figure 5. AMD will also have the 45nm Shanghai and Griffin a 65nm mobile device in their product porftolio and Intel will have older 45nm Core 2 based products as well. Core 2 already has an IPC edge over most alternatives from AMD, and Nehalem will widen this gap. That being said, this is the first time in history that AMD has been able to muster a complete platform under one roof. Certainly in the desktop and mobile market, that will help AMD create a more attractive offering relative to Intel, without getting into a performance race – one that they would likely lose. Or perhaps more to the point, AMD will try and focus on graphics performance rather than CPU performance. While this may work for the destkop and mobile markets, that is decidedly not the case for servers and workstations, where the platform matters, but having the fastest CPU probably matters more.
Figure 5 – Nehalem and Barcelona Comparison
Nehalem is a once in a decade event for Intel – the opportunity to redefine their system architecture so that it will continue to scale for the next ten years (or more). This is an extremely significant change for Intel and the industry as a whole as systems will need to be redesigned across the board. Intel’s integrated memory controller and on-die routing are relatively late compared to the first efforts by the Alpha team at DEC and AMD’s K8. This is likely partially by choice, but also simply due to sheer inertia. AMD had no presence in the server market before they made the shift, which meant that relatively few vendors were bothered by it. In contrast, Intel was held back by the millions of existing FSB based products they were shipping every year and by industry players who would rather defer the transition a little longer. While this posed some competitive problems for Intel since 2003, there are some advantages – by the time Intel started down the path of greater integration their design teams had a chance to learn and improve upon the mistakes of others. This bodes well for Intel’s new system architecture as it appears to be well thought out – but the proof will come later this year along with the first Nehalem based products.
Nehalem is also the first opportunity for a major overhaul of the Core 2 microarchitecture that first shipped in June, 2006. Intel’s Hillsboro design team took great care to create a flexible and scalable processor that can be adapted and customized to all of the market segments that it will serve – which is a far different approach than Intel’s Haifa design team took with Merom. The Merom team was under an incredible amount of pressure from Intel’s management and the extra flexibility would not have been nearly as valuable as the timely delivery of the product. Moreover, the extra flexibility and customization is needed for Nehalem because of the system level integration of the microprocessor itself.
It seems as though almost every part of Nehalem’s pipeline has been tweaked, extended or somehow refined, except for the functional units. The bulk of the changes were made to the memory pipeline, to complement the changes in system architecture. However, the single biggest change and performance gain (made in the core) is simultaneous multi-threading which could improve server workloads anywhere in the range of 10-40%. The other techniques that Intel’s architects have implemented are much harder to peg down precisely, but will tend to apply more evenly across all workloads. Ultimately, it will be very interesting to see how each technique will impact performance, but that information will not be forthcoming till later this year (or perhaps next year).
While most of Nehalem has been discussed by Intel at this IDF in Shanghai, there is still a great deal of information to look forward to – and not just the performance numbers. Intel has not discussed any of the circuit level techniques used in Nehalem, probably both for competitive reasons, but also to get a little more mindshare at ISSCC or Fall IDF. When AMD went to a quad-core, they radically changed their circuit design going from a single clock distribution network to five, with three different voltage planes and dynamic FIFOs between the clock domains. It is guaranteed that Intel will take similar measures to control power and increase performance. The most interesting question is what sort of dynamic clocking techniques will Intel use to improve single threaded performance. It’s no secret that Penryn already can dynamically increase frequency and voltage on one core when the other is idle, and it is clear that this would be even more advantageous in a quad-core design which would have much more thermal headroom when running single threaded code. All these questions and more will give us something to look forward to later in the year when Gainestown arrives and Intel goes into greater detail on the circuit techniques used to deliver Nehalem.
 Singhal, Ronak. Inside Intel Next Generation Nehalem Microarchitecture. Intel Developers Forum, April 1, 2008.
 Kanter, David. The Common System Interface: Intel’s Future Interconnect. httpp://www.realworldtech.com/page.cfm?ArticleID=RWT082807020032. August 28, 2007.
 Bannon, Peter. Alpha 21364 (EV7). http://www.eecs.umich.edu/vlsi_seminar/f01/slides/bannon.pdf
 Pase, D. and Eckl, M. Performance of the AMD Opteron LS21 for IBM BladeCenter. ftp://ftp.software.ibm.com/eserver/benchmarks/wp_ls21_081506.pdf. August 2006.
 DeMone, Paul. Alpha EV8 (Part 2): Simultaneous Multi-Threat. http://www.realworldtech.com/page.cfm?ArticleID=RWT122600000000. December 26, 2000.
 Stackhouse, B. et al. A 65nm 2-Billion-Transistor Quad-Core Itanium Processor. ISSCC Digest of Technical Papers, p92-93, February 2008.
Specal thanks for this article go to George Alfs, Nick Knupffer, Ronak Singhal, Bob Valentine, the rest of the Nehalem team and everyone else who helped me along with this article.