Quest for ILP
The first sign that the party was over was diminishing returns from wider and wider superscalar designs. As CPUs went from being capable of executing 1, to 2, to 4, to even 6 instructions per cycle, the percentage of cycles during which they actually hit their full potential was dropping rapidly as both a function of increasing width and increasing clock rate. Execution efficiency (actual instruction execution rate divided by peak execution rate) dropped with increasing superscalar issue width because the amount of instruction level parallelism (ILP) in most programs is limited. Although expensive hardware based techniques like OOOE and speculative execution help somewhat, they are far from being panaceas as often thought . These techniques also require sophisticated compilers to perform code generation and scheduling specific to individual processors to achieve full benefit. The ILP barrier is a major reason that high end x86 MPUs went from fully pipelined scalar designs to 2-way superscalar in three years and then to 3-way superscalar in another 3 years, but have been stuck at 3-way issue superscalar for the last nine years. The Pentium 4 design is arguably a deliberate move to narrower effective issue width in a quest for higher performance through faster clock rate.
It should be pointed out that everything else being equal, a wider issue processor will still generally provide higher performance even if it the increase is unevenly distributed over the space of all applications that users run on a given architecture. If the only problem of going to wider issue superscalar designs facing MPU architects was the diminishing returns from ILP, it is likely that the BBW bandwagon would still be going strong. This is especially true with the advent of thread level parallelism (TLP) based techniques, such as SMT, that soak up instruction issue and execution slots that would go unused in a wide processor executing a program with little available ILP to extract. After all, if the effect of Moore’s Law provides geometrically increasing numbers of transistors to throw at problems, surely concerns about wastefulness and efficiency are misplaced? Unfortunately transistors are real physical artifacts that consume power and communicate with each other over wires with non-zero delay and finite bandwidth. This leads us to the more serious, and in the end fatal, two barriers facing the BBW approach.
Discuss (11 comments)