ILP Exploitation – EPIC Adventures on the Rough Road Ahead
Not all processor architects have given up on the idea of raising uniprocessor performance by significantly increasing the amount of ILP that can be extracted from a single threaded program. The design approach called Explicitly Parallel Instruction Computing (EPIC), espoused by Intel and Hewlett-Packard in the form of the IA-64 instruction set architecture, combines two strategies to improve performance. The first is to increase clock rate by simplifying processor hardware. This is to be done by moving more responsibility for instruction scheduling and dependency checking from hardware to the compiler. The second strategy is to increase single thread IPC that can be achieved by adding a mixture of new and old features to the instruction set and hoping that compiler technology will advance far enough to use them effectively in combination.
The best known EPIC architecture, IA-64, appears to be in trouble on both parts of this two-track approach. To simplify the hardware, an IA-64 compiler is expected to explicitly schedule code and organize independent instructions into variable-length, packetized instruction groups. To achieve the desired hardware simplification and exploit the explicit instruction dependency checking performed by the compiler, IA-64 processor is an in-order or statically scheduled design. That means it gives up the ability to execute instructions out-of-order, unlike most modern high performance RISC and CISC microprocessors. The increase in IPC from out-of-order execution varies from program to program and processor to processor, but is generally at least 30%. For an architecture intended to increase IPC, in-order design immediately puts IA-64 at a significant disadvantage.
To first overcome this disadvantage and then to potentially pull ahead of superscalar RISC designs, the architects of IA-64 added numerous features (rotating register, predication, compile-time memory disambiguation etc.) not found in most other architectures . Full predication (conditional execution support) of the instruction set allows EPIC compilers to re-organize branch intensive code to eliminate up to about half of all conditional branches and increase the size of basic blocks (linear sequences of instructions amenable to code optimization). This has the effect of improving IPC relative to non-fully predicated architectures (i.e. those possessing only a conditional move instruction) by about 30% . Other features provide a tool kit to allow a sufficiently clever compiler to generate code sequences that could approximate certain aspects of out-of-order execution and increase IPC . The drawback of this approach is if the compiler gets carried away using these features, the code size will explode and effective IPC will drop off because of increased instruction cache miss rate, and competition for instruction issue slots between ‘working’ (program state advancement) instructions and ‘meta’ (dynamic scheduling emulation) instructions.
The clock rate advantage promised by the EPIC design approach may also prove illusionary for IA-64. For example, something as simple as general purpose register (GPR) access is rather involved. General registers gr32 through gr127 are termed stacked because access is through a stack frame that is adjusted by an autonomous piece of logic known as the register stack engine, or RSE, when procedures are entered and exited. In addition to stack frame register mapping, a programmable number of stacked GPRs can be ‘rotated’ by a special branch instruction to facilitate software pipelining of code within loops and help compensate for the lack of register renaming. These two features work in tandem, so to simply access an IA-64 GPR appears to require either 0, 1, or 2 (depending on the logical register value and the contents of four different control register fields) seven-bit addition operations to calculate the physical address. It is hard to discern if such an arrangement will have shorter logic propagation delay than the content addressable memory (CAM) circuit typically used to implement true register renaming in superscalar processors.
The initial IA-64 implementation Itanium (code name is Merced) should not be relied on too heavily as an indication of the clock rate potential of EPIC processors compared to superscalar RISC and CISC processors implemented in similar technology. A more accurate indication may be seen in the advanced program for the 2001 IEEE ISSCC Conference next month. Two presentations that appear to disclose details of the McKinley (2nd> generation IA-64) and Alpha EV7 both describe their subject as 1.2 GHz processors. The fact that McKinley appears to operate with more than a 50% faster clock rate than Merced in the same 0.18 um process while offering at least one third higher IPC demonstrates that IA-64 implementations are far from mature. It is also likely that the hardware-based x86 compatibility built into Merced and McKinley will eventually be simplified to emulator traps in future IA-64 designs, which could conceivably contribute to a further small increase in achievable clock rate.
Be the first to discuss this article!