This is the second installment in a three part article about the futuristic Alpha EV8, the most powerful and ambitious microprocessor yet proposed. It examines the technical challenge and promise of a powerful new technology called Simultaneous Multithreading or SMT.
What is Simultaneous Multithreading?
Generally speaking there are two types of parallelism that can be exploited by modern computing machinery to achieve higher performance. The Instruction Level Parallelism (ILP) approach attempts to reduce program runtime by overlapping the execution time of as many instructions as possible, to as great a degree as possible. The EV8 will have higher performance than earlier Alpha designs through the enhanced exploitation of ILP made possible by its eight-instruction issue width. But gains from higher ILP come at a high and ever increasing price. Building wider machines runs into the problem of geometrically increasing complexity in control logic while data and control dependencies within the program code limit performance increases. John Hennessy of Stanford University has likened the difficulty of increasing exploitation of ILP for greater performance to the task of pushing a boulder up a mountain whose slopes grow ever steeper the further processor architects progress [1].
The second form of parallelism is called Thread Level Parallelism or TLP. This simply means the ability to execute independent programs, or independent parts of a single program, simultaneously using different flows of execution, called threads. The illusion of multiple thread execution is often achieved on a single conventional processor through the use of multitasking. Multitasking relies on the ability of an operating system (OS) to overlap the execution of multiple threads or programs on a single processor by running each thread successively for short intervals. This is shown in Figure 1A. This diagram illustrates program execution using rectangles repeated in the horizontal direction to represent consecutive clock cycles while squares placed vertically in each rectangle represent the per cycle utilization of instruction issue slots in a four way superscalar processor (unused slots are left as white squares).

Figure 1. Multithreaded Execution with Increasing Levels of TLP Hardware Support
Each thread runs for a short interval that ends when the program experiences an exception like a page fault, calls an operating system function, or is interrupted by an interval timer. When a thread is interrupted, a short segment of OS code (shown in Figure 1A as gray instructions in issue slots) is run which performs a context switch and switches execution to a new thread. Multitasking provides the illusion of simultaneous execution of multiple threads but does nothing to enhance the overall computational capability of the processor. In fact, excessive context switching causes processor cycles, which could have been used running user code, to be wasted in the OS.
The most basic type of TLP exploitation that can be incorporated into processor hardware is coarse grained multithreading (CMT), shown in Figure 1B. The processor incorporates two or more thread contexts (general purpose registers, program counter PC, process status word PSW etc.) in hardware. One process context is active at a time and runs until an exception occurs, or more likely, a high latency operation such as a cache miss during a load instruction. When this occurs, the processor hardware automatically flushes and changes the thread context, and switches execution to a new thread.
For contemporary MPUs, a memory operation initiated in response to a cache miss can take over a hundred clock cycles, which represents the potential execution of hundreds of instructions. A conventional in-order processor will simply stall and forever lose those hundreds of potential instructions slots waiting for memory to respond with needed data. A conventional out-of-order execution processor has the potential to continue to execute other instructions that weren’t dependent on the missed load data. However, independent instructions tend to be quickly exhausted in most programs and the processor simply takes longer to stall.
But a coarse grained multithreaded processor has the opportunity to quickly switch to another thread after a cache miss and perform useful work while the first thread awaits its data from memory. Many programs spend considerable time waiting for memory operations and a coarse grained multithreaded processor has the opportunity to increase overall system throughput, compared to a conventional processor performing OS-based multitasking. The IBM PowerPC RS64, also known as Northstar, is rumored to incorporate two way coarse grained multithreading capability, although it is not utilized in some product lines.
A more comprehensive way to exploit TLP in hardware is the fine grained multithreaded (FMT) processor. The operation of one variant of this class of machine is shown in Figure 1C. In this type of design there are N thread contexts in the processor and instructions from each thread are allocated every Nth processor clock cycle to advance through the processor’s execution pipeline by one stage. Figure 1C shows the operation of a four-way fine grained multithreaded processor, i.e. N = 4. At first glance its seems like each thread has only 1/Nth the performance potential of a conventional processor. It is actually much better than this simply because the execution pipeline can be made much shorter from the logical viewpoint of a single thread. This reduces instruction latencies, simplifies compiler code scheduling, and increases the instructions per clock (IPC) component of performance.
For example, a four-way fine grained multithreaded processor might provide single cycle latency floating point (FP) addition while conventional processors typically require three or four cycles of latency. That is possible because the FP adder has four physical processor clock cycles to advance a thread’s FP add instruction through what is one logical execution pipeline stage from the thread’s viewpoint. In a similar fashion, memory latency appears to be 1/Nth the number of processor clock cycles from the viewpoint of individual threads. The hardware cost of fine grained multithreading is relatively modest: N thread contexts, and control logic and multiplexors to cyclically commutate instructions and data from N different threads into and out of the execution units. The drawback of this approach is that its performance running any single thread is still appreciably less than for a conventional processor although the system throughput is increased. An example of a fine-grained multithreaded processor is the five-threaded MicroUnity MediaProcessor [2].
The EV8 uses a more powerful mechanism than either coarse or fine grained multithreading to exploit TLP. Called Simultaneous Multithreading (SMT), it allows the instructions from two or more threads to be issued to execution units each cycle. This process is illustrated conceptually in Figure 1D. The advantage of SMT is that it permits TLP to be exploited all the way down to the most fundamental level of hardware operation – instruction issue slots in a given clock period. This allows instructions from alternate threads to take advantage of individual instruction execution opportunities presented by the normal ILP inefficiencies of single thread program execution. SMT can be thought of as equivalent to the airline practice of using standby passengers to fill seats that would have otherwise flown empty.
Consider a single thread executing on a superscalar processor. Conventional superscalar processors such as the Alpha EV6 fall well short of utilizing all the available instruction issue slots. This is caused by execution inefficiencies including data dependency stalls, cycle by cycle shortfall between thread ILP and the processor resources given limited re-ordering capability, and memory accesses that miss in cache. The big advantage of SMT over other approaches is its inherent flexibility in providing good performance over a wide spectrum of workloads. Programs that have a lot of extractable ILP can get nearly all the benefit of the wide issue capability of the processor. And programs with poor ILP can share with other threads instruction issue slots and execution resources that otherwise would have gone unused.
Be the first to discuss this article!