Implications For Merced Integer Performance
In one clock cycle Merced can issue up to two complex integer instructions and two memory or simple integer instructions. This is similar to the capabilities of the Alpha 21264 design. In addition to four integer instructions, the Merced can also issue up to two branches to fill the remaining two slots in two instruction bundles. In the 21264 a branch instruction displaces a potential integer instruction from one of the two complex integer units. A real difficulty in estimating Merced integer performance is that the diagrams Intel released don’t necessarily show all the bypass paths incorporated into the microarchitecture.
Figure 1 seems to imply a minimum three cycle latency for instructions in distinct groups that have producer-consumer dependencies. If this is true then it would it very difficult for compilers to schedule code for the Merced in a way that avoids pipeline bubbles (dead cycles). Much will also depend on performance loss attributable to branch mispredictions and memory operations. In branch prediction the 21264 seems to hold an advantage over Merced. The 21264 has a two level, tournament branch predictor. This scheme implements both local and global based prediction schemes. The 21264 dynamically chooses the method which gives the best results for the current program. The local scheme uses a 1024 entry by 10 bit history table to index a 1024 entry by 3 bit branch prediction table while the global predictor uses a 4096 entry by 2 bit prediction table. In contrast the Merced employs only a 512 entry by 2 bit dynamic predictor. Obviously Intel is betting heavily that the use of program profiling to guide the setting of hint bits in branches and the use of if-conversion to eliminate conditional branches will make up for Merced’s rather unimpressive branch prediction logic.
For most computer programs memory performance will depend strongly on the characteristics of the cache hierarchy. The Merced employs two levels of on-chip cache and a third level of cache off-chip. The usual rule of thumb about size ratios between levels in a hierarchical cache design would suggest that the Merced’s first level caches are much smaller than the 64 Kbyte each level one instruction and data caches in the 21264 and the total amount of on-chip cache in the two chips are comparable. In some ways the Merced cache scheme is similar to that used in the previous generation Alpha 21164 design. And it is known that, unsurprisingly, the 21264 enjoys an advantage over the 21164 in cache subsystem performance.
Another small but not insignificant consideration relates to addressing modes. Unlike most RISC designs, IA-64 does not have the ability to perform as a single operation a memory access to an effective address formed by offsetting a register pointer by a small (16 bit or less) signed constant. In IA-64 code this memory operation requires a separate Add immediate instruction in an earlier instruction group to generate the effective address and store it into a temporary register. This increases code size and possibly increases critical path lengths in the code. A partial compensating factor is the ability of IA-64 memory access instructions to update a base register after use by adding a small constant to it. This feature would require a separate add immediate instruction on most RISCs.
Intel is obviously very sensitive about releasing information about the Merced helpful for competitive analysis. From what little was revealed at Microprocessor Forum last week a reasonable conclusion is that it is unlikely the Merced will outperform the Alpha 21264 on integer code on a clock normalized basis. This would be fine if the Merced enjoyed a large clock frequency advantage by virtue of its longer execution pipeline. But most independent analysts peg the Merced’s clock rate at 750 to 800 MHz in Intel’s 0.18 um P858 CMOS process. When Merced a.k.a. Itanium reaches the market late next year Compaq’s Alpha group will already have the 21264 core shrunk into a comparable or possibly superior 0.18 um process in the form of EV68. EV67 can achieve at least 750 MHz in a 0.25 um process and it is likely that EV68 will exceed 1.0 GHz. Thus, there is a very little chance that Merced will achieve Intel’s announced goal of IA-64 performance supremacy, at least for integer code.
Be the first to discuss this article!