Alpha EV8 (Part 3): Simultaneous Multi-Threat

Pages: 1 2 3 4 5 6

TLP Wars: SMT versus CMP

SMT is not the only approach to MPU design that can exploit thread level parallelism (TLP). The most obvious approach, and most appealing to chip designers, is chip level multi-processing (CMP). Note that the number of transistors is not the measure of the design complexity of a chip. Indeed, an SRAM chip might require a design effort two orders of magnitude smaller than a microprocessor with the same transistor count. The difference is the number of unique circuits and structures that must be designed, laid out, placed, and verified. Although a structure like an adder might be replicated a half dozen times or more in an MPU, the control portion contains very little logic that can be re-used elsewhere in the design.

The appeal of CMP is simple. If you have a trusted 3 or 4 issue wide superscalar processor core shipping in production devices then that core represents a huge investment in design, debug, and verification effort, in both time to market and dollars. Next generation processors typically target the next generation CMOS process that offers higher clock rate, lower power, and greater design complexity. Why expend the huge effort developing a new 6 or even 8 issue wide processor core, a design effort of exponentially increasing complexity and diminishing returns, when you could simply take advantage of the new process to plunk down two or more exact copies of your current processor core along with a large shared L2 cache? There is still a large engineering effort needed to port the existing core to the new process, but it is far less costly, time-consuming, and error prone than designing and implementing an all new microarchitecture. Another advantage of CMP is each replicated CPU is smaller, and therefore likely able to be clocked faster, than a single more complex processor core. With CMP there is also the option to target a wider spread of price/performance points than possible from simple frequency binning by selling devices with a processor disabled, either by deliberate choice or due to a manufacturing defect in that CPU.

The implementation advantages of CMP are very attractive and likely why IBM chose it for POWER4 and AMD has apparently chosen it for ‘Sledgehammer’, the server class implementation of its next generation K8 processor core. However, in both of these cases the replicated core is not taken from an existing product but rather is a new, or substantially new core. So the appeal of CMP in these two instances would seem to be the ability to achieve high performance on multithreaded applications with a new core of moderate complexity. The strategy of higher performance through ‘cut and paste’ also seems evident in the Sun MAJC-5200 device [3]. The MAJC-5200 contains 2 copies of a 4 issue wide VLIW processor core that share a 16 KB data cache, DRDRAM memory controller, and other functions.

The drawback of CMP is the execution resources contained in the extra processor(s) cannot be used to harness extra instruction level parallelism (ILP) present in a single threaded program and thus increase IPC. This effect can be seen in Table 2, which compares the instruction throughput of an EV8-like SMT processor to a CMP design with dual 4-issue wide superscalar processor cores, each with half the execution units of the SMT in this scenario [2]. For single threaded applications, SMT can exploit the 2x larger instruction issue width to capture more ILP and yield about 14% higher IPC. For multithreaded workloads with four or more threads, the SMT provides 27% higher throughput with the same execution resources.

Table 2 Comparison of Instruction Throughput for CMP and SMT

Processor Type

Threads = 1

Threads = 2

Threads = 4

CMP

2.1

3.3

SMT

2.4

3.5

4.2

It should also be pointed out that CMP and SMT are not necessarily exclusive technologies. It is conceivable that in the future Compaq might integrate two copies of the EV8 design on a single integrated circuit as an alternative to increasing thread support and/or issue width beyond EV8. As far as software is concerned, there is no significant difference between an 8 thread SMT MPU and a CMP device with two 4 thread SMT MPUs.


Pages: « Prev   1 2 3 4 5 6   Next »

Be the first to discuss this article!