Transistor Count and Density are Determined by Design Balance
An even bigger influence on transistor count and density is the actual composition of the chip. Every modern design is built from some combination of logic for computation, memory (typically SRAM) for storage, and I/O for communication. However, these three constituents are all differ radically in terms of density, as illustrated in Table 1. Poulson and Tukwila are platform compatible and share the same overall goals of delivering high-performance and the highest levels of reliability for mission critical servers.
The processors comprise four major regions: CPU cores, L3 cache, the system interface, and I/O. Based on the reported information, Poulson also includes 18mm2 of die area for whitespace or other functions. The CPU core region includes the cores and performance optimized L1 and L2 caches and is dominated by high-speed logic that targets operation over 1.7GHz for Tukwila and 2.5GHz for Poulson. The large L3 caches (24MB for Tukwila and 32MB for Poulson) are designed for maximum capacity and uses the densest 6-transistor (6T) SRAM cells possible with dedicated power rails to ensure stability. The system region includes an assortment of functions – a crossbar for communicating I/O and memory traffic across the die, QPI and memory controllers, home agents for the directory-based coherency protocol and directory caches, and power management units. The system region is generally not as dense because the logic is fixed-frequency and many of the biggest components (e.g., the crossbar) are dominated by large high bandwidth busses crossing the die, rather than transistors. Lastly, the I/O region contains the physical interfaces for external communication, which are implemented using high-speed serial interconnects including four full-width QPI links, two half-width QPI links, and two FB-DIMM2 or Scalable Memory Interfaces (SMI) that fan out to four channels of memory. The interconnects use differential signaling and total around 600 pins.
Quantitatively, these two processors illustrate crucial trends that hold true across all major chip designs. First of all, the variation in transistor density between different regions of the chip is enormous – over 20X, and dwarfs the factor of 2X that is associated with a single generation improvement according to Moore’s Law. Naturally, the cache region which primarily comprises ultra-dense SRAM is the densest and makes up most of the transistors in each design. The cache is about 3-5X as dense as the computational logic in the cores, again larger than a single scaling factor. The I/O is the least dense portion of the two designs, because it contains many delicate analog circuits such as PLLs and DLLs, digital filters, and the large, high-voltage I/O transistors that are used to transmit and receive off-chip data. Additionally, many I/O regions must occupy enough of the edges of the chip to connect all the pins and the area is determined by the number of pins, not the density of the circuits.
The data above clearly demonstrates that the transistor density of modern chips is strongly a function of the purpose and composition of the chip. To take an extreme example, imagine a 32nm design that is based on Poulson, but with no L3 cache – it would have a transistor density of around 2.57M/mm2, well under half the density of the actual Poulson design. In the other direction, a hypothetical version of Poulson with just compute and cache and no I/O or system regions would have a transistor density of 9M/mm2.
Table 2 contains details on several chips manufactured on TSMC’s 12nm and 7nm process nodes that highlight the impact of design composition on density. As a first illustration, AMD’s Radeon VII and RX 5700 are relatively similar GPU designs on the same node and have nearly identical density. On the other hand, AMD’s Renoir and Nvidia’s A100 are about 1.5X the density of these GPUs, perhaps reflecting a focus on density, or potentially more mature design tools. Another useful comparison is Nvidia’s V100 GPU and the NVSwitch, which is an 18-port NVLink switch. They are on the same node, but the latter is primarily I/O and on-die routing for NVLink and as a consequence, the V100 is 1.37X denser than the NVSwitch.
Lastly, the two smartphone SoCs are 1.35X-2.29X denser than the rest of the 7nm processors. This impressive density is possibly due to the different optimization targets – smartphone SoCs are tailored for low-cost and high-density, while AMD’s processors tend to target high performance. Additionally, Apple and HiSilicon are larger and more profitable companies that can afford larger design teams and greater optimization efforts. However, it is also possible that the transistor counts and density for the mobile SoCs are the resulting of a different form of transistor accounting. The last column in Table 2 indicates how the vendor is counting transistors, which we will discuss in greater detail on the next page.