Hardware Requirements for SMT
Compared to a conventional out-of-order execution superscalar processor like the EV6, the following hardware changes are necessary to support SMT operation:
- Multiple program counters (PCs), and the capacity to select one or more of them to direct instruction fetch each clock cycle.
- Association of a thread identifier with each instruction fetched to distinguish different threads for the purpose of branch prediction, branch target buffering, and register renaming.
- A per-thread capacity to retire, flush, and trap instructions.
- A per-thread stack for prediction of subroutine return addresses.
One of the most remarkable aspect of SMT is it takes relatively little extra logic to add the capability to the execution portion of an out-of-order execution superscalar processor that employs register renaming and issue queues. Register renaming is a scheme in which the logical registers in an instruction set architecture (ISA) are mapped to a subset of a larger pool of physical hardware registers. Each time an instruction is decoded the logical register specified to be overwritten with the instruction result (i.e. the destination register) is assigned a mapping to a new physical register, i.e. it is renamed. When the instruction completes execution and retires, its physical destination register becomes officially bound to the logical destination register within the processor state, i.e. the result is committed. Register renaming permits out-of-order execution of instructions to proceed even in the presence of false dependencies as shown in Figure 2.
Figure 2 Data Dependencies and Register Renaming
Register renaming is also done to permit speculative execution beyond conditional branches since it allows the results of speculated instructions to be discarded and earlier processor state restored if the branch turns out to be mispredicted. In this case it is only necessary to restore an older mapping of logical to physical registers.
The beauty of register renaming is that it allows an SMT processor to contain multiple thread contexts without the need for multiple physical register sets or additional complicated tracking logic to ensure execution results from instructions from different threads are written to the appropriate thread context. For example, the Alpha EV6 has 80 physical integer registers (there are actually 160 integer registers in the EV6 device but these are really two duplicate sets of 80 for reasons I won’t go into) and 72 physical FP registers. At any given time, 31 of the 80 physical integer registers contain the contents of the 31 logical general purpose registers that appear to the programmer in the Alpha ISA (there are actually 32 logical integer registers but one of them always reads as zero, as is customary for RISC architectures). The remaining physical registers are available for renaming. The EV6 uses two separate twelve-port register mappers for integer and FP register renaming, and each can rename up to four instructions per clock . Content addressable memory (CAM)-based tables are used to hold the register mapping state. The map tables are also buffered so that an older state can be saved and later restored, if necessary to recover from branch mispredictions and exceptions.
At first glance, implementing a four-way SMT like the EV8 would seem to require four separate and independent register mapping tables, one for each thread. This could be physically realized with a single map table if the size of logical register specifiers used by the mapper is expanded to 7 bits by appending a two-bit thread identifier associated with a fetched instruction to the 5 bit logical register specifiers extracted from the instruction itself. So thread context 0 would use mapper logical registers 0 through 31, thread 1 would use mapper logical registers 32 through 63 and so on. In this scheme each quadrant of the mapper CAM would have the capability to be independently backed up in buffers and restored as needed to maintain the illusion of serial, in-order execution of each thread.
Early research into 8-issue wide superscalar out-of-order processors suggests that with a 64 entry dispatch queue at least 96, and preferably 128, physical registers are needed to limit the fraction of time the processor is out of free registers to 15% and 10% respectively . It is known that the EV8 supports four thread contexts in hardware . This suggests that the EV8 needs an additional 96 integer physical registers above and beyond a conventional 8-issue wide superscalar. That places the number of integer physical renaming registers in the EV8 in the range of 192 to 224 for optimal performance. It should be noted that this exceeds even the 128 logical/physical integer registers required in implementations of Intel/HP’s IA-64 instruction set architecture. Such a large, highly ported register file has the potential to seriously limit EV8’s clock rate even with the use of an advanced 0.13 um process. The best solution to this problem is to spread register read and write access across two pipe stages instead of one. This has the effect of lengthening the basic execution pipeline from EV6’s seven stages to nine stages as shown in Figure 3. One study suggests the extra two pipeline stages in the hypothetical EV8 will degrade single thread performance by less than 2% .
Figure 3. Comparison of EV6 and Hypothetical EV8 Execution Pipeline
Be the first to discuss this article!