In 2005, when Intel’s roadmap took a right hand turn, they were in full scale retreat in the server market due to their own product weaknesses and the strength of AMD’s Opteron offerings, particularly the highly integrated dual cores. Intel’s share of the dual processor server market had dipped substantially, and their share of larger servers (four sockets and up) had positively cratered, dropping to approximately 50%.
While the Core 2 Duo was the first visible result of the right hand turn, the real culmination is Nehalem, which is aimed specifically at the server market – where AMD had made the most inroads. Nehalem is also one of the first designs to truly take advantage of Intel’s 45nm process (Atom got there slightly quicker and had some 45nm specific optimizations).
While it’s beyond the scope of this review to discuss all of the changes in Nehalem, it is productive to mention the most significant departures from the previous generation. For those interested in a full description of Nehalem, I recommend two of my previous articles. The first article is a detailed analysis of CSI – Intel’s cache coherent interconnect that is replacing the front-side bus – based on filed patents. The second article focuses on the Nehalem architecture itself in great detail. At some point in the future, there may be a third article discussing the circuit level techniques used in Nehalem that were disclosed at IDF, ISSCC and IEDM.
The most substantial architectural differences between Nehalem and the prior generation Penryn are:
- Integrated quad-core instead of two chips in a MCP
- Simultaneous multi-threading
- Redesigned memory hierarchy with private 256KB L2 cache, shared 8MB L3 cache
- Triple channel DDR3 integrated memory controller
- On-die point-to-point Quick Path Interconnect and new cache coherency protocol (MESIF) instead of front-side bus and MESI protocol
- Power gates to completely shut off all power to cores or uncore when idle (instead of just clock gating, which doesn’t reduce leakage)
- Turbo mode to boost operating frequency based on thermal headroom
- Core improvements such as SSE 4.2, improved branch predictors, TLBs, etc.
Most of these improvements have previously been elaborated on, except for turbo mode. Conceptually, the idea behind turbo mode is that when the processor is operating below its peak power, it will increase the clock speed of the active cores by one or more bins to increase performance. Common reasons for operating below peak power are one or more cores may be powered down, or the active workload is relatively power (e.g. no floating point, or few memory accesses). Active cores can increase their clock frequency in relatively coarse increments of 133MHz speed bins, depending on the SKU, the available power, thermal headroom and other environmental factors.
Accompanying the new CPU and system architecture is of course, a new platform, the Tylersburg or Intel 5520 chipset. Now that the memory controller has migrated onto the CPU die itself, the Tylersburg I/O Hub is primarily responsible for providing PCI-Express interface connectivity to the CPUs.
Tylersburg features two full-width QPI ports that operate at up to 6.4GT/s and cannot be bifurcated. If two Tylersburgs are used in a system, each IOH will connect to a single CPU and the other IOH using the QPI ports. For high performance, Tylersburg includes a 128 entry write cache (only supports MEI states) to prefetch ownership for inbound writes and coalesce together multiple writes to a single cache line. Full line writes are immediately evicted from the cache to free up space for incoming transactions and prefetch hints are supported as well to write network packets into the coherent memory space.
Tylersburg has two PCI-E Gen 2 x16 lanes and a x4 lane operating at up to 5GT/s, 6 Gen 1 lanes at 2.5GT/s and an ESI port (which is PCI-E x4 with some proprietary extensions) that connect to ICH9/10. Tylersburg also supports the new VT-d2 I/O virtualization extensions with interrupt remapping and several other improvements. Intel’s IOAT2 has also been upgraded to support several 10GBE links.