Intel’s EPIC Striptease Continues

Pages: 1 2 3

Merced Pipeline Stages: Count Them On Your Fingers

The most significant technical information revealed was the number of pipeline stages within the Merced processor, namely ten, and some of the internal operations performed in each stage. A modern computer processes instructions by dividing the work required to execute each instruction into a number of sequential operations or steps. Each step takes one clock cycle to execute and these steps progress like an assembly line. The path taken to execute a particular class of instructions is called a pipeline and the steps are called pipeline stages. The number of pipeline stages is an important design consideration in a modern microprocessor.

In general, the more pipeline stages a processor has the faster it can be clocked because less work has to be performed between consecutive clock edges, but there are also many drawbacks to increasing the number of pipeline stages. These include increased circuit complexity and power dissipation, greater introduced timing overhead related to the registers placed between pipeline stages, and less efficient instruction execution due to dependencies between instructions. These inefficiencies get more severe with an increasing number of pipeline stages so there is always a point of diminishing returns where it is not worthwhile or practical to add more stages.

The Merced pipeline incorporates 10 pipeline stages. On the one hand this is less than the 12 integer pipeline stages found in the Intel P6 processor core design (used in every generation of Intel x86 processor since the Pentium Pro, albeit with minor additions along the way for ISA extensions like MMX and SSE). On the other hand it is noticeably more than the 7 integer pipeline stages found in many high end RISC designs. Everything else being equal, more pipeline stages leads to higher clock rates and somewhat higher levels of performance. But things are rarely equal.

Compare the Alpha 21264 (also known as the EV6) to the Intel P6 core. In similar CMOS semiconductor processes with 0.35 um drawn feature sizes, the 21264 reached clock rates over 500 MHz with its 7 integer pipeline stages while the P6 Klamath was limited to a little over 300 MHz even with its 12 integer pipeline stages. This disparity exists even though the Alpha EV6 core is more sophisticated than P6 and could actually perform more integer computational work per clock cycle. The answer to this paradox is that things are rarely equal. The P6 pays a large performance penalty because it implements an older computer design, x86, which is generally known as a complex instruction set computing design or (CISC). This is a complicated and somewhat controversial subject suitable for an entire discussion on its own. So I will simply state that the P6 has to perform extra functions in order to decode and execute instructions compared to a RISC design and leave it at that.

What’s the Extra Three Pipeline Stages For?

In high end RISC processors, which are found in the highest performance computers you can buy today, most designers have settled on 6 or 7 integer pipeline stages. The generally acknowledged fastest general purpose microprocessor design is the Alpha 21264A (or EV67), which is the previously mentioned 21264 design moved into a more advanced 0.25 um CMOS process. Not only does this 7 integer pipeline stage RISC design reach a clock rate of 750 MHz in a garden variety semiconductor process, it can initiate the execution of up to four instructions every clock cycle and sustain up to 80 integer instructions “in flight” (being overlapped in execution) at any given time. The current Compaq Alpha design and its descendants represent the toughest competition for the Merced and IA-64 follow-up chips like McKinley.

Intel and HP have promoted their new IA-64 instruction set as enabling the design of very powerful microprocessors capable of issuing many instructions every clock cycle while avoiding the complicated control logic necessary in a superscalar RISC processor like the 21264. Although the Merced can issue up to six instructions per clock its integer and memory operation execution unit resources are quite comparable to the 21264, yet the Merced has three more integer pipeline stages than the 21264. Lets compare what the Merced and 21264A accomplish in each pipeline stage to better understand these differences. In Figure 1 (next page) the basic pipeline stages of the Alpha 21264 and Merced are shown.

The most obvious difference between the two pipeline designs are the extra stages used in Merced at the start and at the end of the pipeline. An obvious question that arises is to what extent are these extra pipe stages needed strictly for IA-64 instruction execution? Or do they represent a continued x86 burden in the form of the IA-32 compatibility function supported in Merced’s microarchitecture?

Another interesting observation is that the Merced requires three pipe stages for register renaming (basically the addition of three 7 bit values) and bypassed register read. In the 21264, register renaming (which is a true table lookup) and bypassed register read can be accomplished in two (albeit disjointed) pipeline stages. Are physical access time limitations associated with a huge 128 entry, fourteen ported register file coming back to haunt Intel? Fred Pollack of Intel did let slip a comment that Merced ended up with more pipeline stages than planned for earlier in the project. The pipeline design of McKinley, the follow-up IA-64 implementation spearheaded by highly competent veteran PA-RISC designers from HP, should be quite revealing on some of these issues.

Pages: « Prev   1 2 3   Next »

Be the first to discuss this article!