Over the last several years Intel and HP have heavily promoted the EPIC processor design approach, and IA-64 in particular, as the next great step in the evolution of high end processor design. It is readily apparent that relying on increasing exploitation of ILP to drive processor performance onwards and upwards is a difficult path to follow and will offer meager and hard fought for gains. It is even debatable whether or not EPIC is the best way to increase ILP exploitation, since the 12 year old postulate used to justify EPIC, the idea that superscalar processors would choke on their own complexity, is demonstrably no more true today than it was in 1989. There is also no reason that full predication, memory disambiguation, and data speculative techniques cannot also be used by superscalar RISC based architectures. This would obtain the benefit of these features without the need for static scheduling or code size expansion.
TLP is also a basic source of higher performance and the reason we have single computer systems with 16, 32, 64 or more processors in operation running large scale applications like data base management, on-line transaction processing, and simulations of physical phenomena for scientific and technical applications. CMP is the obvious approach to using TLP to increase MPU performance. But SMT increases TLP exploitation of a uniprocessor MPU by modestly building on the mechanisms of speculative out-of-order execution already in place in high-end processors. For a given set of execution resources (functional units, caches, TLBs etc), SMT provides better single thread performance and multithread performance than CMP. The disadvantage is increased design effort and time to market. The relationship between EPIC, CMP, and SMT is shown in Figure 3.
Figure 3. Relationship between EPIC, CMP, and SMT
The SMT processor can be thought of as the multi-fuel engine of computer architectures. When high levels of ILP is present in the workload an EV8-like SMT can use its wide issue width, and deep, speculative out-of-order execution to help exploit ILP nearly as well as an aggressive EPIC processor. When high levels of TLP is present, then the SMT can exploit it more adroitly than a CMP. In contrast, CMP cannot generally exploit high ILP content in workloads while EPIC cannot exploit high TLP content in workloads. SMT seems to be the best approach to use to design a general purpose microprocessor.
Intel and HP argue that CMP and SMT are techniques that can eventually be applied to IA-64 processors once the ILP well runs dry. A CMP IA-64 processor may appear relatively soon because ILP-based performance increases from wider issue fall off rapidly, especially for an in-order processor. A CMP with dual 6 issue wide IA-64 processor cores might prove superior to a single 12 issue wide design for many applications, especially if EPIC compiler technology development stalls. On the other hand, applying SMT techniques to IA-64 appears very, very difficult. Not only do IA-64 implementations deliberately avoid the superscalar implementation infrastructure that SMT builds on, the huge architected state of IA-64 (128 GPRs, 128 FP registers etc.) would mean support for extra threads would greatly increase the size and/or number of physical register files which could hurt clock rates. Other complex elements like the register stack engine would likely have to be replicated on a per thread basis. It is ironic that an SMT enabled EPIC MPU would need to accrue far more hardware complexity than that which Schlansker and Rau originally sought to avoid.
The Alpha EV8 is an exciting new design for several reasons. It is by far the most aggressive speculative out-of-order execution superscalar RISC processor yet proposed. It will exploit SMT, arguably the most important new development in computer architecture in the last ten years, to double its sustained throughput to 8 to 10 billion instructions per second. When the EV8 first ships (2002?) it should drop easily into Compaq’s then existing high performance computing platforms built around the EV7 and its on-chip dual 4 channel direct Rambus memory controllers and four 6.4 GB/s interprocessor communication link channels. It is hard to imagine what other architecture or platform could come close to challenging single or multiple processor EV8 systems in raw performance. But the onus is on Compaq to execute their high-end product strategy much more effectively then they have since acquiring DEC and Alpha in order for this technology to have the impact in the marketplace that it deserves.
 J. Lo et al, ‘Converting Thread-Level Parallelism to Instruction-Level Parallelism via Simultaneous Multithreading’, ACM Transactions on Computer Systems, Vol. 15, No. 3, August 1997, pp. 322-354.
Be the first to discuss this article!