This article discusses three problems facing modern high performance MPU architects: decreasing returns from ILP, increasing power dissipation and relatively constant interconnect performance. Each issue is explained and analyzed in detail, as well as the impact of all three on future MPU designs. Finally, an analytical model for MPU performance that incorporates each of these three factors is derived and compared to current and past MPUs and then used to make predictions for future microarchitectures.
Bigger, Better, Wider
The first 20 years of the modern 32 bit era of microprocessor design for computer applications could be described as a quest for bigger, better, wider (BBW) CPUs. This trend is driven by the desire for better performance, and fueled by an exponential growth in transistor budgets. The first MPUs of this era, like the MC68020 and the i386, were partially pipelined scalar designs that took multiple clock cycles to execute the simplest of instructions and averaged 3 to 8 cycles per instruction (CPI). These chips packed several hundred thousand transistors on a die around 100mm2 in size.
The first generation 32 bit MPUs gave way to fully pipelined scalar designs like the i486, MC68040, and the first generation RISC chips. These processors could execute at burst rates of one instruction per clock cycle and averaged around 1.3 to 2.0 CPI. The CISC chips, equipped with 8 KB of integrated cache used around 1m transistors on a die of about 150mm2 while the RISC MPUs used external caches which kept their transistor count and die size to under 100k transistors and around 60mm2 respectively.
With multiple millions of transistors within their reach, MPU architects designed the first superscalar chips in the early 1990s. The Pentium and Alpha EV4 could execute up to two instructions per cycle, while the PowerPC 601 and SuperSPARC could execute up to three instructions per cycle. Die sizes ranged from 112mm2 for the 601 to 294mm2 for the Pentium. The transistor count was 1.7m for the Alpha and about 3m for the rest. Although these designs all incorporated on-chip cache, from 16KB to 36KB, CPU control logic and datapaths still accounted for the majority of the chip area.
As transistor budgets passed the 10 million mark, architects introduced second generation superscalar designs that could execute up to 3 (x86) or 4 (RISC) instructions per cycle. Although the first integrated L2 cache was introduced in the Alpha EV5, the CPU itself still occupied most of the die area in this generation. In many cases, the extra transistors were consumed to implement out-of-order execution (OOOE) and speculative execution. Indeed, the PA-8000 occupied a 347mm2 die, without so much as a single bit of integrated cache. But the previously clear blue skies that seemed to stretch endlessly ahead of ambitious MPU architects were already starting to darken. The BBW philosophy of processor design faced three different but interrelated barriers that opposed continued progress down this road with geometrically increasing strength.
Discuss (11 comments)