The biggest change to the mainstream server landscape in the last 5 years was the introduction of Nehalem-EP in early 2009, which brought a shared L3 cache, power gating, QPI, integrated memory controllers and simultaneous multi-threading to the table. This single-handedly erased AMD’s technical advantages. When Westmere-EP was released in 2010, it was certainly an improvement, but an incremental one because it was compatible with the existing platform. The performance increased by around 20-40%, although any workloads that could use the new AES instructions saw larger gains.
The next real change was AMD’s Orochi, based on the novel Bulldozer microarchitecture. Instead of trying to compete with Intel by racheting up the performance of each core, AMD attempted to change the problem. In theory, Bulldozer was intended to give up a small amount of single threaded performance, in exchange for substantially more cores and superior parallel performance. For certain workloads, this approach is viable. However, the execution did not really match AMD’s goals. Orochi was generally an improvement (and introduced the idea of programmable TDP caps), but has not really come close to seriously challenging Westmere-EP on performance or efficiency and did not substantially alter the mainstream server market.
For server workloads, Sandy Bridge-EP and the Romley platform changes are nearly as dramatic a shift as Nehalem was in 2009. The performance gains in our tests ranged from 55% to north of 100%, although the comparison does not use the highest frequency Westmere-EP (which is 3.46GHz, rather than 2.93GHz). Even considering this potential gap, the benchmarks are quite remarkable considering that the number of cores only increased by 33%. The major factors in the performance gain are the Sandy Bridge core, the L3 cache and ring along with turbo-mode.
Intel’s marketing materials capture a broader slice of the market and a true comparison of top bin processors. Workloads such as virtualization and life sciences which show a milder 30% improvement, but there are quite a few areas with 50-70%. Generally, the scientific style workloads realize the greatest benefits, towards the upper end of this range, while classic commercial server benchmarks seem to be closer to 50%.
More significant than the performance is the power efficiency and flexibility for Sandy Bridge-EP. The dynamic range for platform power is far greater than ever before, which demonstrates the pervasive nature of power management. In our tests, the highest average power consumption for the Sandy Bridge-EP system was 564W (for wrf in SPECfp), compared to 351W for Westmere-EP (on hmmer in SPECint). The idle power for these two systems was 109W and 162W respectively. The difference between peak and idle system power is almost exactly the same as the total TDP for the Westmere-EP processors, 190W. In contrast, the newer Romley platform has a gap between peak and idle power that is 1.69× the TDP of the processors. The ratio of the peak to idle power over the TDP is a good approximation of the dynamic range and clearly shows the strides Intel has made in platform power management.
The contributions to the platform power come from many subtle aspects of the system. The on-die I/O is a big factor, since the unused PCI-E lanes can be power gated off. The new voltage regulator specification includes bridge shedding, which increases efficiency for light workloads. The memory controller more aggressively closes pages in DRAM and can slow down the CKE to conserve power. The most obvious though is the variable frequency L3 cache and ring bus, which can scale with activity.
These advances in computational performance and power efficiency are significant, but fail to capture the impact of the integrated I/O. On paper, Sandy Bridge-EP’s I/O bandwidth is 3× higher than Westmere-EP (which is limited to ~25GB/s per socket by QPI), although in some systems the difference may be closer to 4× or 5×. It is far too expensive to construct a storage array or network that could come close to saturating the 2x80GB/s of PCI-E bandwidth in our test system. The initial applications will probably be in HPC, with GPUs and Infiniband; but even there, that is an incredible amount of I/O connectivity. Fortunately, Intel is planning to roll out 10GBE in tandem with Romley, which should drive the networking prices down to reasonable levels and ramp adoption.
Several of these features make Sandy Bridge-EP and Romley well suited to the large scale data centers that underpin cloud computing at companies like Microsoft, Google, Facebook and Amazon. The platform power management can specify a rack or server power limit, which substantially boosts data center efficiency (in all fairness though, AMD had this feature earlier with Bulldozer). The power management policies are more agile and can allocate power based on the workload, favoring the cores, caches, memory or I/O for different situations. Additionally, high performance I/O is critical because of the massive internal networking that is required for cloud infrastructure.
From a competitive standpoint, this cements Intel’s position as the clear leader in the bulk of the server and data center market. Previously, AMD was behind, but within reach. That is no longer the case and AMD’s server roadmap has been updated to accept this unfortunate reality. Rather than releasing a new server platform with integrated PCI-E 3.0, AMD will continue with the existing platform through 2013, while enhancing the CPU cores. This is a tacit acknowledgement that AMD cannot directly compete with Sandy Bridge-EP for most of the market, given a disadvantage in process technology and a less mature architecture. Instead AMD’s hope is to change the game, using high density and low-power systems built around the newly acquired SeaMicro and spend more engineering resources to overhaul the mainstream server line for 2014 and beyond.
Overall, Sandy Bridge-EP is a huge step forward for the server market compared to the previously available platforms. The comprehensive improvements are a huge motivation to upgrade; there should be benefits for any sort of workload, whether it is limited by computation, I/O bandwidth, memory or power consumption. In fact, the only possible drawback is that the fastest versions are modestly more expensive than previous generations. The top-bin parts are around $2,000, whereas the fastest Westmere-EP was $1,666, although this is easily justified by the increase in performance and energy efficiency. Looking forward, future versions of the Romley platform will bring these benefits (particularly the integrated I/O) to 4-socket servers, bifurcating the QPI links to connect all sockets directly. The 22nm Ivy Bridge-EP is socket-compatible and will provide a modest increase in performance in 2013. As Sandy Bridge-EP demonstrates, Intel can design phenomenal products and hopefully the innovations will continue. Even if AMD is unlikely to be competitive prior to 2014, there is still the threat from alternative architectures such as GPUs for scientific computing and low-power, scale out systems from AMD, Calxeda and Applied Micro.
Discuss (15 comments)