Alpha EV8 (Part 2): Simultaneous Multi-Threat

Pages: 1 2 3 4 5

Instruction Selection Strategies For SMT

I have described how the execution engine portion of an out-of-order superscalar processor implementing register renaming can be modified to support SMT operation. The big design issue with SMT is the algorithm that chooses between threads for the fetch and issue of instructions to that execution engine. A number of different schemes associated with 8 issue wide SMT RISC processor designs have been investigated and reported in the literature [7]. Some of these schemes are listed in Table 1.

<strong>Table 1 Example SMT Instruction Fetch Thread Priority Schemes</strong>

Scheme

Max. Active

Threads per Cycle

Max Instr Fetched per Thread per Cycle

Description

RR.1.8

1

8

Round-robin, 1 active thread, 1 x 8 fetch

RR.2.4

2

4

Round-robin, 2 active threads, 2 x 4 fetch

RR.2.8

2

8

Round-robin, 2 active threads, 2 x 8 fetch

BRCOUNT.1.8

1

8

Choose thread with fewest unresolved branches, 1 active thread, 1 x 8 fetch

BRCOUNT.2.8

2

8

Choose thread with fewest unresolved branches, 2 active threads, 2 x 8 fetch

MISSCOUNT.1.8

1

8

Choose thread with fewest outstanding Dcache misses, 1 active thread, 1 x 8 fetch

MISSCOUNT.2.8

2

8

Choose thread with fewest outstanding Dcache misses, 2 active thread, 2 x 8 fetch

ICOUNT.1.8

1

8

Choose thread with fewest instructions in DEC/REN/QUE pipe stages, 1 active thread, 1 x 8 fetch

ICOUNT.2.8

2

8

Choose thread with fewest instructions in DEC/REN/QUE pipe stages, 2 active thread, 2 x 8 fetch

The simplest scheme is termed RR.1.8, or round-robin, one active thread, up to 8 instructions fetched. Each clock, the processor selects one thread from those not currently experiencing an instruction cache (Icache) miss on a round robin basis and uses its PC value to fetch up to 8 instructions per cycle for decoding, renaming, and entry into the integer and/or FP instruction issue queues. The Icache design is essentially unchanged from that of a conventional single-threaded 8-issue wide superscalar processor. Variants include RR.2.4, and RR.2.8, which require a dual ported Icache to permit simultaneous access using two different thread PC values. In the latter case the Icache also needs to support 16 instructions/cycle bandwidth, or twice that of a single-threaded processor. This scheme takes as many instructions as possible from the first thread, and fills in any gaps with instructions fetched from the second thread. The RR.1.8 scheme provides 12% better single thread performance than RR.2.4 but RR.2.4 outperforms RR.1.8 with four active threads. Unsurprisingly, the expensive RR.2.8 scheme outperforms both RR.1.8 and RR.2.4 for both single thread and four thread operation.

More sophisticated schemes have been devised to help increase the throughput of the processor. The BRCOUNT scheme attempts to give priority to threads that are least likely to be wasting instruction slots performing speculative execution. It does this by counting branch instructions in the decode (DEC) pipe stage, rename (REN) pipe stage, and instruction queues (QUE). Priority is given to the thread(s) with the smallest branch count. In practice BRCOUNT.x.8 offers little performance advantage over RR.x.8. The MISSCOUNT scheme gives priority to the thread(s) with the fewest number of outstanding data cache (Dcache) misses. Like BRCOUNT, MISSCOUNT.x.8 offers little advantage over RR.x.8.

The ICOUNT scheme takes a more general approach to prevent the ‘clogging’ of the instruction execution queues. Priority is given to the thread(s) with the fewest instructions in the DEC, REN, and QUE pipe stages. ICOUNT has the effect of keeping one thread from filling the instruction queue and favors threads that are moving instructions through the issue queues most efficiently. It turns out the ICOUNT scheme is also highly effective at improving processor throughput. It outperforms the best round-robin scheme by 23% and increases throughput to as much as 5.3 IPC compared to 2.5 for a non-SMT superscalar with similar resources (in this study: 32 KB direct mapped Icache and Dcache, 256 KB 4-way L2 cache, 2 MB direct mapped off-chip cache). In fact, ICOUNT.1.8 consistently outperforms RR.2.8.

The performance difference between ICOUNT.1.8 and ICOUNT.2.8 doesn’t appear to be significant. Given the choice between them, the EV8 designers would likely choose ICOUNT.1.8 to halve Icache fetch bandwidth requirements and reduce associated power consumption. Interestingly, in a more recent paper [6], Alpha architect Joel Emer and his collaborators seem to favor an ICOUNT.2.4 scheme (2 active threads, up to 4 instructions fetched per thread per cycle). At first glance this choice, to the extent that it foretells the actual EV8 fetch heuristic, seems contrary to previous claims by Compaq that the SMT capabilities of EV8 would not hurt its single thread performance compared to a single-threaded processor. One possible explanation for this apparent contradiction may be that the ICOUNT.2.4 scheme as hypothetically implemented in EV8 is capable of using a single thread PC value to access both Icache ports to permit 8 instruction wide fetch capability for a single thread when appropriate. The processor organization of this hypothetical ICOUNT.2.4 based EV8 design is shown in Figure 4.


Figure 4 Hypothetical EV8 CPU Organization

Compaq claims the overall impact of adding SMT capability will be to increase the die area of the processor portion of the EV8 device by less than 10% [8]. It is harder to gauge the extra burden SMT imposes on the already considerable design and verification effort for an eight issue wide superscalar processor, even one implementing a streamlined and prescient RISC architecture like the Alpha ISA. The potential for EV6-like schedule slips in the EV8 project seems ominously tangible if Compaq’s Alpha managers and engineers haven’t taken to heart the lessons of that unfortunate period.


Pages: « Prev   1 2 3 4 5   Next »

Be the first to discuss this article!