Memory Controllers, SSE4.2 and SMT
The integrated memory controller for Nehalem features up to 3 channels of DDR3 memory operating at 1.33GT/s at launch, for a total of 32GB/s peak bandwidth. The memory controller supports both registered and un-registered DDR3, but no FB-DIMMs for the mainstream implementations. FB-DIMM support will likely come with Beckton or Nehalem EX. Each channel of memory can operate independently and the controller services requests out-of-order to minimize latency. To take advantage of this 4x increase in memory bandwidth, each core supports up to 10 data cache misses and 16 total outstanding misses. In comparison, the Core 2 could have 8 data cache misses and 14 total misses in-flight.
The integrated memory controller substantially improves memory latency (especially relative to FB-DIMM based solutions). The local memory latency for Nehalem is about 60% of the latency for a Harpertown system using the 1.6GT/s front-side bus, which implies the absolute latency is on the order of 60ns (Harpertown is just slightly under 100ns). For two socket implementations of Nehalem, the remote latency is higher, since the memory request and response must go through a QPI link. Remote latency is roughly 95% of a Harpertown system – so even in the worse case, latency will improve. An interesting question is where four socket servers based on Nehalem will fall. It seems likely that these systems will use FB-DIMMs, which implies a latency penalty, but the remote latency should still be about the same – roughly 30ns slower than local.
As with other systems that use an integrated memory controller and on-die interconnects (EV7, K8, Barcelona, etc.), the memory latency is non-uniform (NUMA). For optimal performance, the operating system must be aware of the differences in latencies and schedule processes that share data on the same socket. While Linux and proprietary operating systems have been NUMA-aware for a long time, Windows Vista is the first client operating system from Microsoft to make any NUMA optimizations.
The difference in local versus remote memory latency is roughly 1.5X for Nehalem. Measurements for the K8 show that the NUMA factor (i.e. remote latency divided by local) is roughly the same for two socket systems . However, for four socket systems, Intel will have an advantage since all memory will be either be local (no hops over QPI) or remote (one hop over QPI), while current four socket K8 and Barcelona systems have some memory that is two hops away over HyperTransport. In general, the larger the NUMA factor, the more important it is for software to take memory locality into account. For reference, one of the first processors with an integrated memory controller and on-die interconnects (the EV7) had NUMA factors from 1.86-5.21 (1-8 hops away) in a 64P system .
Core Wide Changes
Nehalem also includes an instruction set extension which spans the entire core pipeline since they impact microcode. Nehalem includes the SSE4.2 instructions, which include several instructions for string manipulations, a CRC instruction and a popcount. The string instructions are all microcoded and will only show a small performance gain. The CRC instruction is used for calculating checksums which is useful for storage and networking and provides fairly substantial benefits – in the range of 6-18X for the code snippets that Intel demonstrated. Of course, the overall speedup will be much smaller, since Intel’s examples just deal with the tightest inner loop.
The last major change to the system architecture for Nehalem is the return of Simultaneous Multi-Threading (SMT), which was first discussed in the context of the EV8, but first appeared for the 130nm P4. While SMT is not strictly speaking a system level change, but a core change, the implications span all major aspects of Nehalem so it is best to mention it upfront. Additionally, SMT has implications for system architecture. Given two identical microprocessors, one with SMT and one without, an SMT-enabled CPU will sustain more outstanding misses to memory and use more bandwidth. As a result, Nehalem is very likely designed with the requirements of SMT in-mind. For instance, variants of Nehalem which do not use SMT (notebook and desktop processors most likely) may not really need the full three channels of memory.
One interesting issue is why the Core 2 didn’t use SMT. Certainly it was possible, as Nehalem shows. SMT increases performance in a very power efficient way, which is a huge win, and the software infrastructure was already there. There are two possible answers. First, Core 2 might not have had enough memory and inter-processor bandwidth to really take advantage of SMT for some workloads. In general, SMT substantially increases the amount of memory level parallelism (MLP) in a system, but that could be problematic when the system is already bottlenecked on memory bandwidth.
A much more plausible explanation is that while designing a SMT processor is relatively easy – the validation is extremely difficult. Supposedly Willamette, the 180nm P4, actually had all the necessary circuitry for SMT present, but it was disabled due to the difficulty of validating SMT until the tail end of the Northwood 130nm generation. More importantly, almost all of the experience with designing, validating and debugging SMT processors resides with Intel’s Hillsboro design team, rather than the group in Haifa. Thus a decision to avoid SMT for the Core 2 makes a lot of sense from a risk management perspective.