Softer Can Be Better
To improve the x86 performance of its current and future Itanium processors Intel developed a software based compatibility system called IA-32 Execution Layer (EL). It resembles FX!32 in that it sits above a native 64 bit operating system. But it is much closer to CMS in that it performs RAM based translation and optimization in real time during program execution and does not retain translated code segments or program profile information between runs. A very important aspect of IA-32 EL is that it does not rely on the presence of any features associated with hardware based x86 compatibility within an Itanium processor . This gives Intel the freedom to remove all vestiges of hardware based x86 compatibility from future Itanium processors and instead rely on IA-32 EL to execute user level x86 applications. Such a move on the hardware side could provide tangible benefits in terms of reduced die size, complexity, power consumption, time to market, and a potential increase in processor clock frequency. The physical overhead of implementing hardware based x86 compatibility in a 130nm (Madison) Itanium 2 is depicted below in Figure 5.
Figure 5 – Hardware x86 Compatibility Overhead in Itanium 2
IA-32 EL differs from FX!32 and CMS in that it does not use interpretation when it executes a section of x86 code for the first time. Instead it engages in what its designers call “cold code translation.” This is a low overhead, table and template driven translation of x86 code into equivalent IA-64 instruction sequences. Cold code translation is performed on one x86 basic block at a time. A basic block is a term used by compiler designers to denote a section of code with one entry point at the beginning and one exit point at the end. For typical programs, basic blocks are on average 4 to 5 x86 instructions long. In addition to whatever IA-64 instruction(s) are necessary to perform the equivalent computation, a cold translation of an x86 basic block includes extra instructions that collect information to be used for a more aggressive translation later on, so called “hot code translation”. Translated cold code is instrumented with code to implement a basic block execution counter, an edge counter for basic blocks ending with a conditional or indirect branch, and misaligned memory operation detection. The information cold code collects about its own execution is used to direct hot code translation in the same way as the execution profile information collected by the FX!32 emulator is used to direct the FX!32 translator.
When a basic block’s execution counter reaches a certain threshold, the instrumentation code within that block branches to a special entry point in the translator, to register that block as a potential candidate for hot code translation. When 1) the total number of cold code blocks registered reaches a certain value or 2) any given block has registered twice then the translator initiates an optimization session during which hot code translation techniques are employed. In practice about 5 to 10% of cold block reach the “heating threshold” and register. Hot block translation begins with the identification of a trace of x86 basic blocks that can be combined into a hyper block – a section of code with one entry point and multiple exit points. The selection of blocks is governed by the value of their execution and edge counters. Sets of block that are candidates for IA-64 predicate based if-conversion and loop unrolling are also identified. After a hyper block is selected, the translator goes back to the original x86 code and converts it into an intermediate language (IL) representation. The IL code then undergoes a series of x86 specific optimizations (redundant flag and effective address generation elimination, FP/SIMD register usage tracking etc.). The translator then reorganizes the hyper block code employing such techniques as register renaming, control speculation, and data speculation.
The computational overhead of hot code translation is about 20 times higher per x86 instruction than cold code translation but in practice the vast majority of the processor cycles consumed running an x86 program using IA-32 EL are spent running translated code, not generating it. For example, when running the 26 applications comprising the SPEC CPU 2000 benchmark suite, about 95% of execution time is spent running hot code while only 3% is consumed by translator overhead. Using IA-32 EL, a 1 GHz, 3 MB Itanium 2 runs x86 versions of the 12 SPECint2k component benchmark applications with a performance averaging 65% of that achieved by the native IA-64 versions. In another example cited in , a 1.5 GHz Itanium 2 achieves 105%, 99%, and 133% of the x86 performance of a 1.6 GHz Xeon processor when running SPECint2k, SPECfp2k, and the Internet content creation portion of Sysmark 2002 respectively. That performance level is approximately 3 times better than can be achieved using the I2’s hardware based x86 compatibility mode. The “better than Xeon” result is simple to communicate, easy to grasp, and may be an important psychological milestone in helping to dispel the image of poor x86 compatibility mode performance that has surrounded IPF since the introduction of the first Itanium processor (Merced) in 2001.
Discuss (18 comments)