Pages: 1 2
Westmere Arrives
Last year was a turning point for Intel in the two socket server market, with the release of the 45nm Nehalem family of quad-core microprocessors. The Nehalem microarchitecture was a substantial redesign over the previous generation with higher performance cores and more refined and elegant system level integration. The system level features in Nehalem, such as the three channel integrated DDR3 memory controller and QuickPath interconnect, were long overdue and critical for success in the server market. AMD demonstrated the value of these features back in 2003 with the K8. With an extra 6 years of development, Intel’s system architecture is more refined and higher performance. Nehalem had the luxury of focusing on a new microarchitecture to substantially improve performance, e.g. by adding simultaneous multi-threading. The combination of an improved system architecture and microarchitecture proved to be a decisive advantage for Intel and restored their primacy in the 2-socket server market (the 4+ socket Nehalem-EX will come out later this month). The next generation Westmere eschews major microarchitecture changes and migrates to a new manufacturing process. As a result the performance gains are primarily due to additional cores.

Figure 1 – Westmere Die Photo, Courtesy Intel
Westmere is a 32nm family of microprocessors that is derived from Nehalem. For servers, Westmere-EP integrates upto 6 cores with minor microarchitectural enhancements and 12MB of L3 cache into 248mm2, as shown above. Westmere is socket compatible to Nehalem, and inherits the associated system infrastructure including the 55xx chipsets. No doubt this smooth upgrade path is a tremendous relief to OEMs, board designers, resellers and almost the entire ecosystem. The planned upgrade from the quad-core Nehalem to the 6-core Westmere is the main reason why Nehalem featured 3 channels of DDR3 memory. Two channels probably would suffice for Nehalem itself, but would have left Westmere unbalanced, since Intel planned to maintain the same amount of last level cache per core (2MB).
Security, Page Tables and Virtualization
The architectural improvements in Westmere focus on security, including 6 new instructions (AES-NI) for accelerating AES cryptography plus a carryless multiply for finite field arithmetic (PCLMULQDQ). AES-NI avoids the ubiquitous look up table that is common to many AES algorithms and is the source of a vulnerability that exploits the data dependent nature of a look up table. A user-space side channel attack can analyze cache access timing to rapidly derive the secret keys on the same physical processor. The latency of AES-NI is data independent, thus avoiding this vulnerability. While there is more to cryptography than just AES, this is likely to be a first step and hopefully more comprehensive support will follow in future products. Westmere also includes Trusted eXectuion Technology, which verifies firmware and hypervisors before they can begin executing.
Westmere supports 1GB pages, a feature first seen in AMD’s Barcelona, which is useful for databases and high performance computing applications.
Another long overdue improvement to the page tables is the Processor Context ID (PCID). The PCID is a field in each TLB entry that associates a given page to a process. Previously, Intel’s TLB could only contain entries from a single process and whenever the CR3 register was written (e.g. a context switch), the TLB was flushed. The PCID lets pages from different processes safely inhabit the TLB together, so that CR3 writes no longer flush the TLB. Whenever a process tries to access a page in memory, the PCID is checked to determine whether the page is actually mapped into the process’ address space; if the PCID does not match then a TLB miss occurred. This is very much analogous to Intel’s VPID, which enables the TLB to contain pages from different virtual machines and avoid TLB flushes during VM transitions.
Virtualization is another area of improvement for Westmere, which can virtualize real-mode, so that 16-bit guest OSes can be used. The round-trip VM latency dropped by 12%, compared to Nehalem. Westmere also has a new technique for detecting when the PAUSE instruction is used in a spin loop in a guest process. When a spin loop is detected, the VM may deschedule the guest process to avoid wasting cycles that would otherwise be spent in the spin loop. The PAUSE instruction is used for other beneficial purposes (e.g. temporarily pausing while an I/O is initiated) and not just spin loops. So the detection mechanism must avoid triggering false positives, which could impact performance.
Discuss (22 comments)