Caught on the Wire
The most difficult problem facing MPU architects in the future is the ever widening mismatch between how falling process feature size increases transistor operating speed while doing little for the performance of interconnect in the context of constant or slowly growing die size and increasing CPU complexity. A typical process shrink reduces feature size in all three dimensions by ~30% (i.e. dimensions are multiplied by 0.7). As a first order approximation, transistor scaling causes circuit delay time to also fall by 30%. The propagation delay of most wires on a chip is dominated by its so-called RC delay, the product of its total distributed resistance and capacitance. The capacitance per unit length of a wire changes very little while its resistance per unit length approximately doubles for a 0.7 factor shrink (the wire’s cross sectional area drops by 0.72).
The effect of process shrinks on wire scaling depends on the type of wire. A local wire is a relatively short wire, internal to a functional block that is shrunk with its circuit and layout topology intact. A wire that tends to run “3 transistors over and 7 transistors up” will drop in length by the same factor as the shrink. Because RC delay is quadratic with respect to length, the shortening of the wire cancels out the increase in resistance per unit length so the RC delay is approximately constant. Although constant wire delay in the context of a 30% reduction in transistor delay is a relative increase, in general local wires are so short that even after a shrink it is usually much less that transistor delay. It is global wires, those that run an appreciable and relatively fixed fraction of die size, that really cause problems. Even if the complexity growth in MPUs was such that die size is kept constant in a shrink the RC delay of global wires approximately doubles in a shrink while transistor delay falls by about 30%. If nothing was done to address the problem MPU designers would face the prospect of dealing with global wires whose propagation delay increased relative to transistor speed by a factor of 2.9 for each process shrink!
Fortunately, there are several methods process engineers and chip designers use to partially counter the wire/transistor scaling imbalance. There are one time fixes that change chip composition to reduce interconnect resistance, R, (copper instead of aluminum) or the capacitance, C, (lower K materials separating wires) values. However, these have a one time effect that is rather modest given the long term trends.
A common design based approach to combat poor wire performance scaling is the insertion of repeaters. If you take a long RC delay dominated wire of length L and break it into two segments of L/2 by inserting a repeater (inverter or buffer) to amplify the signal, the RC delay of each L/2 segment is 1/4 that of the original wire. The total RC delay reduction obtained is greater than the extra transistor delay introduced by adding the inverter or buffer for even moderately long wires. Breaking wires into shorter segments using repeater insertion also has the benefit of countering inductive effects found in long wires. An example of this practice is found in the Madison Itanium 2. Although this chip is relatively large in size, 374mm2, all signal wires are broken into segments under 2mm in length using repeater insertion . Unfortunately, repeaters consume power and the numbers of repeaters needed to counter interconnect scaling imbalance rises geometrically with shrinking feature size. Even in the 180nm McKinley and POWER4 processors, buffering global signals consumes 3% of the power and requires about 100,000 repeaters respectively .
One powerful technique that is used to improve interconnect scaling is sizing by layers. This is done by arranging interconnect so that the thinnest and narrowest layers of metallization are placed at the lowest level (closest to the chip substrate and transistors). The metal layers are gradually scaled up in thickness, width, and separation as the layers are stacked up one at a time during fabrication. The lowest layers have the highest RC delay factor and are only used for local interconnect. The topmost layers are the thickest and coarsest; hence they have the least resistance and RC delay factors. These upper layers are used for power and clock distribution, as well as global signal routing. The real potential of this scheme is realized when adding layers of interconnect in a process shrink. The new finer sized and spaced wires can conceptually be added at the lowest level with a 0.7 scaling factor, while the topmost level(s) keeps the same approximate size and spacing as in the previous generation process. Therefore, the RC delay of the topmost layer(s) of interconnect remain approximately the same after the shrink. This process is illustrated in Figure 2.
Figure 2 – Interconnect Scaling in an Ideal Shrink with Extra Metal Layers
The interconnect layers shown use alternating orthogonal routing (M1 wires run perpendicular to the image while M2 wires run across the image and so on). This technique preserves the global interconnect performance despite process shrinks. Ideally, this technique will slow down the growing difference between global signal propagation delay and transistor switching speed from an approximately cubic relationship with the linear scaling factor to approximately linear with scaling. That is, wires are still getting slower with respect to transistors but not nearly as fast (1.4x slower instead of 2.9x slower per shrink).
Unfortunately even this amelioration cannot be kept up indefinitely. Adding extra layers of metal to a process increases manufacturing costs and reduces yields. In addition, the number of high quality wires per unit area stays constant in this scheme while the number of transistors per unit area doubles with each shrink. This provides MPU architects with an unpalatable choice between severely limiting the growth of global signal flow within their processors or increasingly using the slower intermediate interconnect layers. The impact on wire performance from using intermediate or local interconnect layers is shown in Figure 3. The red, green, and blue colored lines indicate the performance of local, intermediate, and global signal layers in an ~90nm process with a conventional oxide dielectric. The thick black line indicates the point at which a signal propagation in a wire changes from being RC delay limited to being limited by the speed of electromagnetic wave propagation along a transmission line  (sometimes incorrectly called the “speed of light” limit; the propagation speed is really about half the speed of light in air or a vacuum due to the relative permittivity of the oxide dielectric separating the conductors in this example).
Figure 3 – Wire Performance versus Size
In reality, the interconnect performance issue is more complex than presented here. Real processes don’t shrink all features, either minimum sizes or minimum separations, on all layers by the same ideal 0.7 factor. More importantly, there are numerous tricks that physical chip designers can use to optimize interconnect performance, especially for bused signals. For example, simply increasing the physical separation between a critical signal wire and its closest neighbors can significantly improve performance by reducing parasitic capacitance per unit length. However, this technique is of little practical use on a large scale because it tends to push up MPU size which in turn increases average wire length. The general trend is as clear as it is unavoidable. The performance of interconnects will fall increasingly behind that of transistors as process feature size shrinks and this factor will more strongly influence the design of each successive generation of MPUs.
Discuss (11 comments)