Debunking the Rehabilitation of x86
Given the growing economic importance and reach of x86 processors in computing it has become fashionable among some in the industry to try to rehabilitate the image of the x86 instruction set architecture and minimize its architectural shortcomings. I am inclined to think of these revisionists as victims of a kind of technological Stockholm Syndrome. With advances in computer architecture and processor implementation increasingly held hostage by x86 due to an insidious combination of software based network effect and semiconductor manufacturing economies of scale, those affected individuals, who often hailed from the software development community, inexplicably defend and even express affection for x86.
A powerful, obvious, and simplistic argument for the x86 architecture is that “it delivers”. For several years now, x86 MPUs have held a more or less continual leadership position in general purpose computing performance as measured by the SPECint 2000 benchmark suite. And for about the same period, the fastest x86 MPUs have been well within a factor for two of the performance leader in technical computing performance as measured by SPECfp 2000. In an era of CPUs implemented with millions of transistors, the nature of the ISA implemented is no longer as strong a determinant of MPU performance and cost as it once was in the late 1980s. Architecture has been eclipsed by the sophistication and feature size of the semiconductor process used to manufacture it, and sometimes even by as the degree of design effort and sophistication that goes into the implementation. The effect of process feature size disparity in the growth of relative x86 performance over the last decade is shown in Figure 1.
Figure 1 Relative x86/RISC Performance vs Process Technology
Figure 1 clearly illustrates how x86 and RISC processors went from a situation of approximate parity in semiconductor process technology ten years ago to a situation today where x86 will soon be two complete process generations ahead of the two traditionally fastest RISC processor families and one process generation ahead of the heir apparent RISC standard bearer in the server market.
There is a second significant effect that magnifies the CPU performance of top end x86 desktop processors relative to RISC processors. It is related to the effect of speed bin granularity and timing margins. High end RISC processors have targeted primarily server class hardware since the Pentium Pro appeared 8 years ago. With server hardware the primary requirement of the processor is rock stable reliability rather than achieving the last 10% of performance. Design timing guard bands are set very conservatively and individual processor/system configurations are characterized in time consuming and expensive verification exercises. Because of this expense and effort each generation of server processor in a given technology is typically commercialized with 2 or 3 speed grades. In contrast clock speed is an important price differentiating factor for desktop processors and the systems they go into. The desktop version of the Intel Northwood Pentium 4 processor is currently available in seven different speed grades. The effect of speed binning granularity on top processor performance is shown in Figure 2.
Figure 2 Effect of Speed Binning Granularity
The same yield and frequency distribution is divided into 2 and 7 speed grades in the top and bottom curves respectively in Figure 2. In both cases the speed grade frequencies have been selected to yield an approximately equal number of parts in each bin. What is obvious from Figure 2 is that offering a large number of different speed grades creates a significantly higher top bin frequency (F7 in the bottom curve versus F2 in the top). If Intel had productized the Northwood Pentium 4 device exclusively for server applications it is very unlikely that the top speed grade would have greatly exceeded 2.5 GHz.
Another important factor is speed tuning. A server processor design in a given process may never change after it is released to production. The expense and development lead time of verifying and recharacterizing even a trivial design change means that such an exercise would be undertaken in the direst circumstances such as to fix a major bug. In contrast x86 desktop processors undergo an unending series of re-spins (modifications) for minor design improvements, yield enhancements, and speed tuning throughout their lifetime.
All these different effects work in combination to inflate the performance of desktop x86 processors in a given process technology relative to RISC designed for server applications. For example, the fastest speed grade of the Alpha EV68 server processor is 1.25 GHz. If the EV68 was manufactured and sold as a high volume desktop processor with a plethora of speed grades and an aggressive process of continuous improvement over its lifetime, it is likely that it would top out in excess of 1.6 GHz. Such a device would greatly exceed the fastest 0.18 um x86 MPU in native performance using a CPU with about half the complexity and size. Unnecessary complexity and loss of significant performance potential is the penalty paid for sticking with a mediocre and long obsolete ISA even if these shortcomings are rendered theoretical by current business and economic realities.
Discuss (83 comments)