Server Processor Anatomy
Figure 1 shows an annotated die photo of Sandy Bridge EP, Intel’s mainstream server microprocessor and the de facto industry standard. The chief components in server processors are the actual CPU cores, cache, I/O, and system infrastructure.
The CPU cores are responsible for the actual computation, and typically include several dedicated per-core caches: an instruction cache, a data cache and a unified L2 cache. L1 caches are small and optimized for latency, ranging from 16-128KB, while the L2 caches are 256KB up to 2MB. Some families, such as Sun’s Niagara only have L1 caches, while some designs have separate L2 instruction and data caches, and yet others actually share the L2 cache between pairs of cores (e.g. AMD’s Bulldozer). Some designs also include dedicated acceleration hardware, which depending on the arrangement might be considered part of the CPU. Server microprocessors typically include 4-16 cores.
The last level cache (LLC) in servers is usually shared between all cores. While the per-core caches are meant to hold frequently used data, the LLC serves a different purpose. The LLC is primarily intended to reduce costly off-chip communication, such as memory accesses and coherency traffic. Accordingly, the LLC is optimized for density and bandwidth. Most designs are based on high density SRAM cells and are in the neighborhood of 16-32MB, although IBM is using eDRAM to achieve capacities as large as 80MB. Some LLC designs are inclusive of the smaller caches (e.g., Intel and IBM’s zSeries) for reliability and simplicity, whereas other teams prefer exclusive caches (e.g., Sun and AMD) for greater effective capacity.
In many respects, the I/O is the most distinguishing aspect of server microprocessors, and it comes in three flavors: memory, coherent, and non-coherent. Server memory interfaces are usually 4 channels wide (~40-50GB/s for DDR3), but some include as many as 8 memory channels. Mainstream designs favor DDR interfaces, while high-end designs often use high-speed serial interfaces to specialized DIMMs for better pin-count, capacity, and bandwidth. Most server microprocessors integrate coherent interconnects for creating multi-socket shared memory systems. Intel QuickPath and AMD Hypertransport are the most common examples, and target 32-40GB/s per link. Server chips have 2-6 coherent links depending on the market; highly scalable servers require greater connectivity and tend to the top of that range. Non-coherent I/O is used to connect the processor to external devices such as GPUs, networking, and storage. Modern designs have almost entirely settled on PCI-Express and the latest x16 gen 3 interface provides roughly 32GB/s, although some systems use serial interfaces to PCI-Express bridge chips to decouple processor upgrades from I/O upgrades.
The system infrastructure is perhaps the most imprecisely defined portion of a server microprocessor, and encompasses the rest of the chip. The system infrastructure usually consists of shared functionality. One major element is a coherent on-die fabric that connects the cores to the LLC and other components; although in some designs (e.g., Intel’s ring) that area is part of the LLC itself, rather than a discrete block. The memory and cache coherency controllers are closely coupled to the on-die fabric, for latency reasons. The more common fabric topologies are crossbars and rings, although some systems use a combination of the two. Power management and debugging are small, but crucial elements as well. Some servers also incorporate directory caches to reduce coherency traffic, and specialized hardware for encryption or networking.
The balance between these four components varies, based on a number of factors. Generally, high-end processors are designed for scalable systems with 16-64 sockets. These designs must dedicate more area to I/O and system infrastructure to efficiently scale. For example, directory caches are unnecessary for smaller 2-4 socket servers, but a huge benefit for larger systems. Similarly, a large LLC is much more important for scalable systems, because a cache miss must snoop all other sockets. High core count processors tend to require more area for system infrastructure, as the on-chip fabric becomes more critical for performance.
Having explored the major conceptual elements of server microprocessors, the next step is quantitatively examining the balance struck in recent server designs.
Discuss (346 comments)