TLBs, Page Tables and Synchronization
Along with a new cache hierarchy, Nehalem also features some important changes to the TLB hierarchy, which caches the virtual to physical address mappings. Core 2 had a very interesting TLB arrangement. The L1 DTLB (which Intel referred to as the micro-TLB sometimes) was extremely small and only used for loads; it featured 16 entries for 4KB page and 16 entries for larger page, and each was 4 way associative. The L2 DTLB was larger and serviced all memory accesses (loads that misses in the L1 DTLB and all stores). It offered 256 entries for 4KB pages and 32 entries for larger (2M/4M) pages, and again, both were 4 way associative. Since the cache hierarchy contains much more data for Nehalem, the TLBs also needed to be enlarged.
Figure 4 – Memory Subsystem Comparison
Nehalem replaces the pre-existing TLBs in Core 2 with a true two level TLB hierarchy that is dynamically allocated between the active thread contexts for SMT. The first level DTLB in Nehalem services all memory accesses and contains 64 entries for 4KB pages and 32 entries for larger 2M/4M pages and keeps the 4 way associativity. Nehalem also features a new second level unified (i.e. both instructions and data) TLB that contains 512 entries for small pages only, and is again 4 way associative.
One of the stark differences between Nehalem and the Core 2 TLBs is the degree to which they cover the caches. In Core 2, there was 6MB of cache and the TLBs could translate 2176KB of memory using the smaller 4KB pages (most applications do not use large pages), effectively covering half or a third of the full L2 cache (depending on whether we are discussing Merom or Penryn). In contrast, each Nehalem core has 576 entries for the small pages and 2304 for the whole chip. This many TLB entries can translate 9216KB – more than enough to contain the whole 8MB L3 cache using small pages alone.
Nehalem’s TLB entries have also changed subtly by introducing a “Virtual Processor ID” or VPID. Every TLB entry caches a virtual to physical address translation for a page in memory, and that translation is specific to a given process and virtual machine. Intel’s older CPUs would flush the TLBs whenever the processor switched between the virtualized guest and the host instance, to ensure that processes only accessed memory they were allowed to touch. The VPID tracks which VM a given translation entry in the TLB is associated with, so that when a VM exit and re-entry occurs, the TLBs do not have to be flushed for safety. Instead if a process tries to access a translation that it is not associated with, it will simply miss in the TLB, rather than making an illegal access to the page tables. The VPID is helpful for virtualization performance by lowering the overhead of VM transitions; Intel estimates that the latency of a round trip VM transition in Nehalem is 40% compared to Merom (i.e. the 65nm Core 2) and about a third lower than the 45nm Penryn.
Another virtualization tweak in Nehalem is the Extended Page Tables, which actually eliminates many VM transitions (rather than just lowering the latency as the VPID does). The normal page tables map guest virtual addresses to guest physical addresses; however, for a virtualized system, there is also a translation from guest physical to host physical addresses. The EPT manages those mappings from guest physical to host physical. When a page fault happens on the guest physical to host physical mapping, Nehalem will simply walk the EPTs, whereas earlier Intel designs (and AMD designs before Barcelona) would need the hypervisor to service the page fault. This eliminates a lot of unnecessary VM exits.
Nehalem also lowers the latency for synchronization primitives such as LOCK, XCHG and CMPXCHG, which are necessary for multi-threaded programming. Intel claims that the latency for LOCK CMPXCHG instructions (which serializes the whole pipeline) is 20% of what it was for the P4 (which was absolutely horrible) and about 60% of the Core 2. While the latency is lower, the behavior is still similar to prior generaitons; Lock instructions are not pipelined, although younger operations can execute ahead of a LOCK instruction.