IA64 Processor Taxonomy
The natural unit of instruction fetch, issue, and execution in an IA64 processor is the instruction bundle. Some imaginative designers within the IA64 community refer to potential classes of IA64 processors by “banger” notation, a colloquialism taken from the reciprocating internal combustion engine. An IA64 processor capable of fetching, issuing, and executing two instruction bundles (up to 6 instructions) in parallel (i.e. per clock) is termed a “two banger”, the EPIC analog of a two-cylinder engine. A four banger IA64 processor could execute up to 4 bundles or 12 instructions per cycle and so on.
The Merced/Itanium is nominally a two banger. That is, under ideal conditions it can fetch, issue, and execute two IA64 instruction bundles every clock cycle, a potential peak instruction per clock (IPC) value of 6. In reality, Itanium falls well short of this, especially when running integer codes. One reason is a paucity of execution units. As shown in Figure 2, the Itanium processor has 2 I-units (I0, I1), 2 M-units (M0, M1), 2 F-units (F0, F1), and 3 B-units (B0, B1, B2) . The six vertical lines in Figure 2 represent the six slot positions of the two instruction bundles ready for issue. The horizontal lines represent instruction feeds into each functional unit. The circles represent possible issue paths. For example, the Merced can issue an instruction from slot 1 or 2 of the first instruction bundle or slot 1 of the second instruction bundle to integer unit I0.
Figure 2 Itanium Execution Resources
If you compare the set of defined IA64 instruction bundle formats in Figure 1 with the execution resources of the Itanium processor in Figure 2, then it is obvious that most of the possible pairings of bundles cannot be issued in a single cycle. This has a major impact on Itanium performance because 1) even “no operation” (NOP) instructions of a given class must be issued to the appropriate execution unit and will stall if none are available, and 2) instructions are fetched and trundled through the Itanium instruction decoders in integral bundles. The execution capability of the Itanium is represented in Table 2 and is based on rules outlined in . It shows the number of instructions (starting at slot 0 in the first bundle) that can be issued given a specific pair of first bundle format (row) and second bundle (column).
Table 2 Itanium Instruction Issue Capability
Table 2 clearly shows that the Itanium can only dual issue (shown by the entry “6”) instruction bundles for a very limited set of format pairings. In most cases the result is a situation called a split issue. Consider the combination of a MMI format first bundle and a MII format second bundle (we will assume there is no stop at the end of the first bundle that would force a split issue on any IA64 processor). The Itanium can issue the two M type and one I type instruction of the first bundle to execution units M0, M1, and I0. But the Itanium has no third M-unit to dispatch the M instruction in slot 0 of the second bundle (remember, even M-type NOPs must still be issued to an M-unit). This situation is called resource oversubscription and results in the Itanium stalling after the first three instructions of the first bundle. The instruction decoder can only shift out or rotate, the first bundle and bring one new bundle from the eight entry prefetch queue. Because the Itanium handles instructions in bundles, falling even one instruction short of a dual bundle issue causes an instantaneous 3 IPC shortfall.
Notice that an IA-64 instruction bundle can have a maximum of one F type instruction. The Itanium’s two floating-point units, F0 and F1, are fully pipelined, which means it is impossible for Itanium to split issue from F-unit oversubscription (that is not say that the whole machine won’t stall in certain cases to wait for an operand to become available). It is much more difficult to schedule integer code to dual issue on the Itanium. In fact the Itanium handles the MLX bundle format like an MFI bundle to avoid tying up both integer units to execute a single bundle. The inherent structural imbalance in the Itanium microarchitecture is likely a major reason why the Itanium can achieve respectable performance on FP workloads, but not integer.
Be the first to discuss this article!