Alpha EV8 (Part 3): Simultaneous Multi-Threat

Pages: 1 2 3 4 5 6

TLP Superscalar versus EPIC: Who Owns the Future?

It is interesting to speculate how well the thread level approaches to exploiting parallelism (SMT and CMP) will compare to instruction level approaches like EPIC. It is widely acknowledged that the low hanging fruit in ILP exploitation has already been picked, and further gains will be painful and slow. Even then it is uncertain whether EPIC’s statically scheduled, VLIW-like approach will prove effective. Research into the architectural concept that later became known as EPIC started in HP Labs in 1989 [7]. At that time superscalar processors were beginning to be widely developed and the implementation mechanisms necessary for out-of-order execution were still immature. The reasoning behind EPIC was a projection that superscalar processors would eventually choke on their own complexity. Here is the rationale in the researcher’s own words:

“We came to two conclusions, one obvious, the other quite controversial (at least at that time). Firstly, it was quite evident from Moore’s Law that by 1998 or thereabouts it would be possible to fit an entire ILP processor, with a high level of parallelism, on a single die. Secondly, we believed that the ever increasing complexity of superscalar processors would have a negative impact on their clock rate, eventually leading to a leveling off of the rate of increase in microprocessor performance.

Although the latter claim is one that is contested even today by proponents of the superscalar approach, it was, nevertheless, what we believed back in 1989. And it was this conviction that gave us the impetus to look for an alternative style of architecture that would permit high levels of ILP with reduced hardware complexity. In particular we wanted to avoid having to resort to the use of out-of-order execution, an elegant but complex technique for achieving ILP that was first implemented commercially in the IBM System/360 Model 91 and which is almost universally employed by all high-end superscalar microprocessors today.”

Note the first sentence in the second paragraph. Perhaps Schlansker and Rau are no longer as convinced of the driving motivation behind EPIC as they once were. Since 1989 there have been extraordinary leaps in both the complexity and clock rates of dynamically scheduled superscalar processors. Today we have the spectacle of the 42 million transistor Pentium 4 (supporting the overlapped out-of-order execution of over a hundred in-flight instructions) shipping in volume at 1.5 GHz, while in the same 0.18 um aluminum process the 25 million transistor Merced/Itanium struggles to meet its 733 MHz target frequency. The McKinley’s reported 1.2 GHz clock rate will likely be surpassed by the POWER4, EV68, and EV7 as well as the Pentium 4.

It appears that three divergent paths are being taken from the current four and six issue superscalar RISC processor that dominate the high-end MPU market. The first is towards CMP and is best illustrated by the impressive POWER4. The second is towards ILP focused designs like IA-64. The third path leads to SMT whose archetype is the ambitious Alpha EV8. CMP can be viewed as an exercise in cost reduction of SMP systems using higher levels of integration, so I will focus on SMT versus EPIC. Figure 2 illustrates the individual contribution of the input factors to my estimate of the impact of moving to EPIC and SMT on the clock frequency (ignoring process shrinks) and IPC relative to current high end superscalar MPUs.

Figure 2 Rough Estimate of Clock Rate and IPC Impact of SMT and EPIC

It is unlikely that SMT processors will clock as fast contemporary superscalar design in the same process technology because of the added complexity of wider issue and logic to manage the overlapped selection, issue, and retirement of instructions from multiple threads, so I reduced its clock rate potential by 10%. I assign a 10% frequency boost to EPIC from in-order execution, which is then reduced by the complexity of GPR access, bundle decoding and instruction dispersal in a 12 issue design, and x86 compatibility in hardware. This also gives a small net frequency penalty compared to current superscalar processors. With or without x86 compatibility in hardware it is likely that EPIC processor will eventually enjoy a small clock rate advantage over SMT processors, although this could easily be obscured by differences in inherent process speed or implementation quality (e.g. proprietary clock distribution techniques and CAD tools etc.).

It is more difficult to estimate IPC achievable by SMT and EPIC processors. This is partially due to the variability of SMT performance on different work load classes (multiprogrammed, multithreaded, parallelized single thread application etc.) and the reliance of EPIC on highly sophisticated compiler technology that is still under development. I assigned the same 25% IPC improvement factor due to wider issue to both the 8 issue wide SMT and 12 issue wide EPIC because the greater potential ILP of the latter must be balanced against greater slot usage by instructions that do not advance the program state. These include 1) NOPs introduced due to bundle packing, template mismatch, and implementation dependent structural hazards, 2) predicated instructions issued and later squashed, and 3) ‘meta’ instructions inserted into the code stream to explicitly invoke speculative execution. It is hard to estimate the effect of the speculative features built into the IA-64 architecture because it relies heavily on an intelligent compiler (as well as run-time profiling data) to use it effectively.

Based on the overall estimated clock rate and IPC differences, I would expect a 12 issue wide IA-64 processor (Madison?) to achieve somewhat higher levels of IPC on average (10 to 15%) than an EV8-like SMT processor on single threaded applications. Combined with a clock rate advantage of 5 to 10%, that is a potential 15 to 25% performance advantage. This advantage for EPIC will likely disappear for single threaded applications that are amenable to automatic, compiler driven parallelization techniques. In such a case the SMT might see a 30 to 50% IPC advantage and a 20 to 45% performance advantage. For multithreaded and multiprogrammed workloads the SMT will truly shine, offering perhaps 60 to 90% higher instruction throughput per cycle and 50 to 85% higher absolute instruction throughput.

Pages: « Prev   1 2 3 4 5 6   Next »

Be the first to discuss this article!