Alpha EV8 (Part 3): Simultaneous Multi-Threat

Pages: 1 2 3 4 5 6

SMT Can Make Programs Run Faster

Using SMT to double the instruction throughput of an EV8 is a major win for throughput dominated applications like database management systems running on server class machines. The performance of such a system is judged by the number of job tasks or transactions that can be executed per unit time. But many types of important high-end applications are judged by how fast they can run to completion and provide the answer to some problem. If you are trying to predict tomorrow’s weather, obviously the ability to run one full simulation to completion in 8 hours is qualitatively different from the ability to run four separate simulations to completion in 32 hours, even though the instruction execution rate over that time may be the same.

Much research in programming language and compiler design has been conducted in order to be able to automatically parallelize large-scale, end-result oriented programs like weather prediction. That is, to design a compiler that can divide up the computational task so that two or more processors can be harnessed to run the program to completion in less time than is possible with a single processor. A limited degree of success can already be seen in multiprocessor SPECfp95 submissions such as those shown in Figure 1 for the AlphaServer ES40 Model 6/667 and AlphaServer 8400 6/575.


Figure 1. SPECfp95 Improvement From Compiler Parallelized Code

Do not confuse this with SPECrate improvement for increasing number of processors. SPECrate is a throughput benchmark and will generally increase with more processors without any special effort. The result shown in Figure 1 demonstrates the ability of Compaq’s compiler to automatically identify program elements that can be run in parallel on multiple processors and generate the necessary code to distribute and synchronize the computations across those processors, and achieve a net decrease in run time. Theoretically, N processors should be able to run a program in 1/Nth the time of a single processor. Unfortunately, real programs are composed of sections that can be broken down and run in parallel and others that are serial in nature. Amdahl’s law tells us that the maximum speedup we can achieve from parallel processing is limited by the fraction of a program that is serial in nature. In addition, there is program overhead involved in synchronizing separate pieces of code running on different processors that further limits potential speedup.

The SMT capabilities of the EV8 are presented to programmers in the form of four CPU-like processing elements called TPUs (thread processing units). As far as software is concerned, an EV8 is a 4-way CMP (chip level multiprocessor) even though there is only one physical processor. The same compiler technology that can break down a program for parallel execution on a multiprocessor system can also decompose a program into threads for accelerated execution on an SMT processor. In fact, running four threads of a single decomposed program can be more efficient on an EV8 than executing four different programs. This is because the threads of a single parallelized program have better locality than the threads of four different programs, and thus make better use of the available TLB and cache resources due to lower miss rates. This effect can be seen in Table 1, with simulation data for a hypothetical 8 issue SMT running a mixture of scientific and technical applications [2].

Table 1. Simulated IPC for Parallelized and Multiprogrammed Workloads

Number of Threads

Multprogrammed Workload

Parallel Decomposed

Program

Parallel Portion only of Decomposed Program

1

2.2

2.4

2.8

2

4.2

3.5

4.6

4

4.7

4.2

5.5

These results also seem to be consistent with data that EV8 architect Joel Emer presented at the 1999 Microprocessor Forum (unsurprisingly, since Mr. Emer was a co-author of [2]). The Compaq EV8 presentation claimed SMT will improve EV8 multiprogrammed workload throughput by between 50% and 120%, and performance of three parallel decomposed SPEC95 benchmark components by between 30% and 100%. For applications that are multithreaded in nature to start with, the speedup was expected to be between about 35% and 150%.

The primary drawback to implementing SMT is the extra complexity that it adds to an already very complex out-of-order execution superscalar processor. Even though Compaq estimates that adding SMT to the EV8 adds less than 10% to processor core area, the bigger hits may be harder to quantify. One hit is the extra time it takes to design, debug, verify, and characterize an SMT device and put it into production. Another is the impact of SMT on processor clock rate. As was described in part two of this article, the EV8 execution pipeline will likely be stretched by at least two extra stages compared to the EV6/EV7 processor core in order to access much larger and highly ported register files without affecting the maximum clock frequency. This change will likely eliminate the critical path impact of the largest and most obvious negative impact of SMT on a RISC MPU, the large increase in rename registers needed to hold per-thread architected states. While I expect that SMT holds other hidden difficulties, the Alpha design team has repeatedly demonstrated the ability to tackle complex microarchitectural implementation challenges using a combination of innovative logic, circuit, and physical design techniques without allowing clock rate to suffer unduly.


Pages: « Prev   1 2 3 4 5 6   Next »

Be the first to discuss this article!