Two Approaches to Multithreading
As previously mentioned, both the Montecito and POWER5 microprocessors exploit thread level parallelism (TLP) by integrating two processor cores on each device, i.e. two way chip level multiprocessing (CMP) as well as incorporating two way multithreading within each processor. Thus each device has separate instances of architected processor state to support the execution of four different threads. However Montecito and POWER5 use very different multithreading techniques within their CPUs.
The Montecito uses coarse grained multithreading (CMT) which is sometimes called switched on event multi-threading (CMT/SOEMT). Unlike simultaneous multithreading (SMT), which is used by the POWER5 and some Intel Netburst MPUs, CMT doesn’t improve instruction throughput by overlapping instruction execution from two threads to exploit momentarily idle issue slots and functional units. Instead, CMT has the more modest goal of improving instruction throughput by automatically switching back and forth between threads whenever a thread encounters a long latency event such as an L3 cache miss. This difference in operating principle is shown in Figure 7.
Figure 7 – Throughput Increase from Multithreading
Although CMT is less effective than SMT, it is a much better match for the in-order, statically scheduled Montecito processors. For an infinitesimally small increase in processor complexity and area, CMT can be expected to provide Montecito a performance increase on the order of 20% or more on memory/data sharing intensive, cache unfriendly workloads such as on-line transaction processing (OLTP). But in most cases, the benefit will be much less. For example, data in  shows that a McKinley running SPECint2k and SPECfp2k is stalled on memory only about 5% and 11% of the time respectively. Given compiler improvements since 2002 and quadrupled L3 capacity in Montecito that suggests that CMT will only increase SPECrate2k throughput by roughly 2% to 5%.
SMT can be implemented as a direct extension of the dynamic scheduling hardware in an aggressively out of order (OOO) execution MPU design with a complexity/area penalty of under 10% in a RISC design . However IBM chose to significantly increase processor resources in POWER5 to maximize the throughput boost from SMT. This increased the cost of introducing SMT to 24% growth in CPU area. Although this is an order of magnitude greater than what Intel incurred in the addition of CMT to Montecito, IBM claims substantial performance improvements from SMT: ~35% in database transaction processing and Websphere workloads, ~28% in SAP, and ~45% in Domino R6 Mail . IBM also claims SMT increases throughput for SPECint_rate2k and SPECfp_rate2k by about 21% and 10% respectively . This is significantly better than what CMT can do for Montecito because SMT allows for overlap of thread instruction execution time not just long thread stalls.
Discuss (39 comments)