L3 Cache and Ring Interconnect
Since Nehalem, Intel is pursuing a more modular, system-on-a-chip like philosophy, where products may share the same core, but have an entirely different system infrastructure. In the past, this differentiation was achieved with a combination of changes to the CPU silicon and the discrete chipset. This relationship always existed between the mainstream x86 cores and the high-end Xeon designs (e.g. Penryn and Dunnington or Nehalem-EP and Nehalem-EX), which feature far more cache, coherency bandwidth, and reliability features. Now that the chipset is firmly integrated with the CPU for all products, this practice now spans Intel’s entire mainstream x86 product lines. For each product, Intel will optimize the system architecture for the right balance of performance, power, reliability and cost.
Nearly every aspect of the Sandy Bridge microarchitecture has been redesigned to improve per-core performance and power efficiency. In tandem with these changes to the processor core, the overall system design for products based on Sandy Bridge was rearchitected. The major blocks in Sandy Bridge correspond to three power and frequency domains: the cores and last level cache, the graphics and the system agent. The first two domains are variable voltage and frequency, while the system agent runs at a fixed frequency. The graphics integration in Sandy Bridge is complicated enough to merit a separate entire article. Leaving graphics aside for now, the chip level integration features relevant to the Sandy Bridge microarchitecture include the last level cache (the L3) and system agent, including power management and the display engine.
Sandy Bridge is tied together with a high bandwidth coherent interconnect that spans the three major domains. Nehalem and Westmere used crossbar interconnects, which are extremely efficient and high bandwidth for a small numbers of agents – but must be redesigned to vary the number of agents. In contrast, Nehalem-EX and Westmere-EX both rely on a ring topology where the wiring and design effort scales better with the number of agents.
Sandy Bridge also employs a ring interconnect between the cores, graphics, L3 cache and system agent (including the display/media engine). The coherent ring is composed of four different rings: request, snoop, acknowledge and a 32B wide data ring. Together these four rings are responsible for a distributed communication protocol that enforces coherency and ordering. The protocol is an enhanced version of QPI with some specific additional features. The rings are fully pipelined and run at the core clock and voltage and bandwidth scales with each additional agent. The scaling is not necessarily perfect though, because of the topology. As messages travel across the ring, they can block access to other agents – reducing the available bandwidth as the average hop count increases. Rings can have very natural ordering properties, which makes design and validation simpler than a more connected topology or on-die packet routing, while preserving many advantages (short wiring runs). It is also easier to reconfigure the number of agents on a ring for different product variants. Each agent on the Sandy Bridge ring includes an interface block that is responsible for the distributed access arbitration. The protocol indicates to each interface one cycle in advance whether the next scheduling slot for the ring is available – and the interfaces simply use the next available scheduling slot for any communications.
The Sandy Bridge ring has six agents – the four cores and cache slices share interfaces, and there is a separate interface for the graphics and another for the system agent. Like Nehalem-EX, the rings are routed in upper metal layers above the L3 cache to reduce area impact.
The L3 cache for Sandy Bridge is shared by the cores, the integrated GPU and the display and media controller. Each of these agents accesses the L3 cache via the ring. It is no longer a single unified entity as in Nehalem and Westmere, but is instead distributed and partitioned for higher bandwidth and associativity (similar to Nehalem-EX and Westmere-EX). There is one slice of the L3 cache for each core, and each slice can provide half a cache line (32B) to the data ring per cycle. All physical addresses are distributed across the cache slices with a single hash function. Partitioning data between the cache slices simplifies coherency, increases the available bandwidth and reduces hot spots and contention for cache addresses.
The ring interface block for each slice of the cache contains an independent cache controller. The cache controller is responsible for servicing requests for the mapped physical addresses and enforcing coherency and ordering. Like Nehalem, the L3 cache is inclusive and the cache controller maintains a set of ‘core valid bits’ that act as a snoop filter (there is also a valid bit for the GPU now). The cache controller also communicates with the system agent to service L3 cache misses, inter-chip snoop traffic and uncacheable accesses.
The L3 cache for Sandy Bridge is scalable and implementation specific. The write-back L3 cache for high-end client models is 8MB and each 2MB slice is 16-way associative. The load-to-use latency for the L3 cache was reduced from 35-40 cycles in Nehalem down to 26-31 cycles in Sandy Bridge (latency varies slightly with the number of hops on the ring).
The latency decreased due to several factors. First, each slice of the L3 is much smaller than Nehalem’s 8MB L3 cache, so the latency to access the tags and data arrays has decreased. Second, the ring and L3 now reside in the same clock and voltage domain as the cores (and the core clock is certainly faster than the uncore clock in Nehalem). There is a latency penalty for signals crossing to a new voltage and clock domain. This penalty is determined by the ratio between the two frequencies and can be several cycles. Placing the caches and ring in the same domain avoid this problem. The latency of the ring will increase as more agents are attached; each hop on the ring takes 1 cycle, so the latency actually depends on the relative position of the requesting core and the receiving cache slice.
The first Sandy Bridge implementations have four cores and four cache slices. At 3.4GHz, this translates into a theoretical 435.2GB/s of bandwidth from the last level cache, if all cores are accessing the nearest bank. The bandwidth scales up or down with the frequency and number of cores. Smaller slices (e.g. for low-end parts) are created by eliminating or disabling ways of the cache (at 4-way or 512KB granularity).
The L3 cache is allocated between three different domains (CPU, graphics and non-coherent) with way-level granularity, and ways can be shared between multiple domains. The cache controllers monitor various events and are responsible for maintaining quality of service. To reduce the overhead of sharing, each of the three domains has separate coherency and consistency semantics. The CPU domain uses the familiar x86 model. The graphics domain is semi-coherent and controlled by the drivers; data is flushed into the coherent CPU domain by synchronization instructions. The non-coherent domain is also used by the graphics drivers for certain accesses. In this way, the performance and power overhead for coherency is only imposed when it is needed.