Alpha EV8 (Part 1): Simultaneous Multi-Threat

Pages: 1 2 3

Wider Issue Superscalar: Big Pain

Although conceiving an eight-issue wide processor like the EV8 sounds like a straight forward logical progression from earlier designs, it conceals the geometric increase in instruction issue logic complexity that accompanies increasing issue width. For example, the four-issue, in-order execution EV5 required 82 separate 31-bit comparators to perform stall and bypass control signal calculations [2]. The HP four-issue, out-of-order execution PA-8000 core uses a different control scheme, which foregoes encoding register dependencies as bit vectors and requires an incredible 3360 five-bit comparators in the instruction dispatch logic [3]. The six-issue wide EV6 core takes a divide and conquer approach by decoding and register renaming fetched instructions early in the execution pipeline, and dispatching them to either an integer or floating point (FP) instruction issue queue as shown in Figure 1. Instructions are independently issued from the two queues to the associated integer and FP execution units.


Figure 1 Organization of the EV6 Core

Thus the four-issue wide integer instruction issue queue control logic doesn’t have to worry about what the two-issue wide floating point (FP) instruction issue queue logic is up to. Even with this simplification, the 20-entry integer instruction queue required 141k transistors which occupied 10 mm2 of chip area in the 0.35 um EV6 processor [4]. To simplify the control logic further and reduce the potential for a critical timing path, the EV6 integer instruction queue generates overly conservative queue-full stalls whenever there is less than 4 free entries in the queue [5]. This simplification in control logic increases queue-full stalls over an exact solution (i.e., compare the number of new instructions requested to enqueue with the number of free entries to generate the stall condition) by 20%, but only reduces integer performance by less than 1%. The benefit of simplification (sometimes called inexact stalls) is the elimination of about 800 transistors, and more importantly, a critical piece of logic that might otherwise have limited processor clock frequency was made faster. This sort of circuit design versus microarchitecture efficiency tradeoff is common in advanced processor design and no doubt is even more important in the EV8.

A quick and dirty approach to the EV8 would be to expand the EV6’s FP instruction queue issue capability from two instructions per clock to four instructions using the techniques employed in the EV6’s integer instruction queue. With an accompanying doubling of FP functional unit pipelines from two to four, the EV8 could fulfill the advertised capability of eight-issue width with modest increase in control logic. However, such an approach would do nothing to increase the integer architectural performance of the EV8 over EV6. In addition, the extra two FP pipelines would be starved for lack of a concomitant increase in FP load/store instruction processing capability (which is mostly implemented in the integer data path). In the past, Alpha architects have always demonstrated concern to balance integer and FP performance in their designs.

For these reasons it is unlikely the EV8 designers took the easy way out and the EV8 integer instruction issue queue will in all likelihood have greater issue capability than that in the EV6. I think a good clue to what Alpha architects may be planning is the SMT processor described in [6]. Compared to the EV6, the number of integer units is increased from 4 (with 2 also acting as load/store units) to 6 (with 4 also acting as load/store units). The number of FP pipelines is doubled from 2 to 4. This organization also suggests the EV8 may actually be a ten-issue wide machine with a sustained issue rate of 8 instructions per cycle (just as EV6 is a six-issue wide machine with a sustained issue rate of 4).


Pages: « Prev   1 2 3   Next »

Be the first to discuss this article!