Itanium: Super Alloy or Toxic Waste?
It seems hard to believe but the most widely endorsed and adopted 64-bit architecture for future systems is an unproven and controversial design whose troubled first implementation is three years late to market. The Intel Merced/Itanium, the first impression of the enormously complex IA-64 instruction set architecture to be set down in silicon, is an example of how technological issues sometimes matter little in the face of powerful vested business interests and alliances.
The basic underlying idea of IA-64, which its creators call EPIC (Explicitly Parallel Instruction Computing), goes back nearly 11 years to a research project started at HP Labs. At the time the first superscalar processors were being designed and a lot of effort was being expended to understand how to design out-of-order execution processors for the next generation to follow. It is quite ironic that the thinking that led to the hideously complex IA-64 architecture originated as a retreat to the keep-it-simple-stupid (KISS) design principles of the early RISC era in reaction to the daunting challenges faced by superscalar pioneers. EPIC proponents were seduced by the siren call of using Very Long Instruction Word (VLIW) like techniques to be able to build very wide issue processors using minimal control logic. There is no free lunch however, and the downside to EPIC is the reliance on the compiler to practically be clairvoyant in its ability to predict the optimal instruction scheduling strategy. No one has yet coded an algorithm to predict the future so the general compiler strategy is actually to generate code that runs as fast as possible for the execution path, predicted at runtime, to be the most likely. The compiler also has to generate code to check for when these assumptions made at compile time fail, and patch up the computational state sufficiently to generate the correct results, albeit more slowly.
The comparison between EPIC designs like IA-64, and dynamically scheduled superscalar processors (CISC or RISC) is in many ways is similar to that between the centrally planned command driven economies of the old Soviet era and laissez-faire capitalism. With the self-assured arrogance of faceless central planners working on the their next five year plan, EPIC designers assumed that their clairvoyant compilers, combined with their wide issue, high clock rate but inflexible processor hardware would be good enough to overcome the more dynamic and adaptive CPUs of its competitors. The hardware of dynamically scheduled processors may not have the time, resources or instruction search width available to an EPIC compiler to search out potential opportunities for instruction level parallelism (ILP). But it has one huge advantage – the ability and opportunity to adapt in real time to unexpected changes in program and data flow during execution arising from external factors (cache or TLB misses, interrupts etc) or unusual program input combinations.
Just as a five year economic plan cannot predict a massive crop failure in year four and be prepared to quickly take corrective measures, an EPIC processor cannot predict which load operation will miss in every level of the cache hierarchy and freeze the entire instruction execution pipeline for hundreds of clock cycles. A free market economy reacts to a crop failure by increasing the price of the commodity affected, which causes new suppliers or substitutes to be attracted by the opportunity. Similarly, a dynamically scheduled processor will react to a cache miss by initiating the necessary memory operation and using the opportunity to execute non-dependent instructions until either these run out or a re-ordering hardware resource, such as rename registers, are exhausted.
To their credit, the creators of EPIC recognized the limitations of compile time prognostication and attempted to cover their assumptions with a variety of ad hoc architectural features that the compiler could employ to obtain some of the benefits of dynamically scheduled code execution under specific and limited circumstances. For example, rotating registers provide some of the benefit of true register renaming in avoiding the debilitating effects of false register dependencies in the code body of loops. Speculative loads provide some limited ability to overlap a potentially long latency memory access with other instruction execution by allowing the compiler to advance the load beyond control dependencies. EPIC designers also recognized that they would heavily rely on run-time profiling data driven compiler optimization and built in the ability for the compiler to flip ‘bias’ flags in individual instructions as a hint to the otherwise inflexible EPIC hardware as to the optimum execution strategy to follow. IA-64 compilers can control how individual conditional branch instructions will be handled by hardware – whether dynamic branch prediction resources should be expended trying to predict that branch or if the hardware should just statically assume the branch is always taken, or assume it is never taken.
The performance vs. cost trade-off of EPIC processors, relative to dynamically scheduled superscalar RISC processors, are still not known with certainty and probably varies from application to application (i.e. embedded controller vs. technical workstation etc), and over time with the inexorable advancement of semiconductor technology. What is quite obvious is that one of the chief benefits of the EPIC design philosophy, hardware simplicity, is largely going to elude IA-64 implementations. The IA-64 ISA is a product of a joint design committee consisting of technical staff from both Intel and Hewlett Packard. And it shows.
One defining decision of this committee was to include the entire x86 instructions set within the architecture in the form of a hardware-based compatibility mode. The large disparity in complexity between IA-64 and existing 64-bit instruction set architectures is revealed by the implementation technology and characteristics of the first implementation of these architectures shown in Table 2
|0.18 um||>300 (est)||25.4|
Table 2 Characteristics of Initial Implementation of various 64 bit ISAs
Discuss (15 comments)