Microservers must Specialize to Survive

Pages: 1 2 3 4

Server Processor Analysis

We have collected and analyzed data on over 20 server microprocessors released from 2006-2012. These chips span a wide range of instruction sets (x86, Itanium, SPARC, zArch, and several flavors of PowerPC), vendors (Intel, AMD, Sun/Oracle, Fujitsu, and IBM), and process nodes (90nm down to 32nm). The area for each major component was measured as a percentage of total die area and shown in two cohorts in Figures 2 and 3.

Server Die Area with External Memory Controller

Figure 2. Server Die Area with External Memory Controller

Conceptually, the CPU region can be thought of as the proportion of the die allocated to computation. The core area includes the per-core caches, with a few exceptions. Since AMD’s L2 caches are very large (and use the highest density SRAM cells) and exclusive of the LLC, these are counted as part of the overall ‘cache’ region. The variable capacity of the Bulldozer L2 cache also suggests a natural separation from the core. The massive 1.5MB L2 caches in the IBM z196 are accounted for similarly, in part because they actually are in a different clock domain. The z196 co-processors and the Bulldozer FPU are shared between adjacent cores and included as part of the core area.

In most cases, the cache region is simply the LLC and associated controllers. In certain cases mentioned above, this also includes large L2 caches where evidence strongly suggested a clear separation between the core and L2. This region is really meant to reflect the area spent accelerating memory accesses. Note that IBM’s z6 and z196 have large external eDRAM caches and external coherency controllers in addition to the LLC region.

Because the lines between I/O and system infrastructure are fairly inconsistent across different designs, the two components are grouped together in a single ‘system’ component. For example, some teams only count the physical interfaces as I/O, whereas others include memory and PCI-E controllers. Generally, this region is responsible for tying together the cores, caches, and external interfaces. The system region for some Intel designs is a little smaller than the data would indicate as most Intel server processors (excluding Tulsa, Dunnington, Montecito and Tukwila) use a large ring fabric which runs over the LLC and is not counted as system area. Most other designs use a crossbar or fabric with a modest amount of dedicated area; for example the POWER7 and POWER7+ on-chip interconnect is approximately 5-6% of the die. In some cases, the system portion also includes grout, or unused areas of a chip.

Server Die Area with Integrated Memory Controller

Figure 3. Server Die Area with Integrated Memory Controller

The data in Figures 2 and 3 is quite interesting. First, it shows a huge split between different eras of server microprocessors. Earlier server designs with external memory controllers and coherency logic predominantly spent area on the cores and cache, as shown in Figure 2 (note that includes modern zArch designs, even those with on-die memory controllers due to the consistency model). Comparing this to the modern servers in Figure 3 shows the impact of integrated memory controllers and coherency logic. While older chips had less than 20% of the area for the system, newer designs average 36%. Newer design teams have shifted area primarily from the cache (but also a small amount from the CPU cores) to integrate system level functionality.

Focusing on Figure 3, the data shows that the allocation of area between the regions of the chip is very stable. Averaging across all 18 designs, the core area is 34%, the cache 30%, and the system 36%. Moreover, the variation across designs is pretty tight. The one real outlier is the POWER6, which had a surprisingly low core count at the time (2 cores in 2007, when 4 cores was the norm), and thus a rather large amount of L2 cache (and no L3). The area breakdown does not change significantly with time or process node, which suggests that there really is a sweet spot for servers.

Even more intriguing, Figure 3 includes a number of alternative architectures. Blue Gene/Q and the Niagara family both use cores with low single threaded performance, in order to better target HPC and commercial workloads, respectively. The area breakdown for these five alternative processors is barely different from the truly general purpose server microprocessors. The average areas for the cores, cache, and system are 30%, 28% and 42%. Shockingly, these ‘alternative’ architectures are not radically different at all. In essence, the data implies that the overall nature of server processors does not vary significantly. The architects merely replaced a few large and complex cores with many smaller and simpler cores.


Pages: « Prev   1 2 3 4   Next »

Discuss (346 comments)