Software Implications of SMT
An obvious question to ask is how does an SMT processor offer up its multithreading capabilities to software. In the case of the EV8, it is with an abstraction called a thread processing unit or TPU. A TPU is essentially a single-threaded virtual processor that is presented to the lowest level of the operating system hardware abstraction layer (HAL). The EV8’s four way SMT capabilities are represented with four separate TPUs as shown in Figure 5.
Figure 5. Software View of the EV8
Essentially the EV8 appears to software as consisting of four separate processors that share a single set of translation lookaside buffers (TLBs) and caches. The advantages of SMT over a real four-way chip level multiprocessor (CMP) are there is only one physical processor occupying die area and cache coherency occurs without extra logic or overhead.
Can the EV8 execute threads from different processes simultaneously? (i.e. threads with different address spaces). That hasn’t been disclosed but the simple answer is, it would probably be easy to permit but it wouldn’t be desirable in practice because it could thrash the TLBs. It is easy to permit with a mechanism called an address space number (ASN) or address space identifier (ASID). In conventional processors an ASN is a small hardware register (typically 6 to 8 bits in size) containing a unique value that is appended to virtual addresses prior to translation. The purpose of doing this is to speed up context switches in a multitasking operating system by avoiding flushing and reloading the TLB state, and flushing and/or invalidating the caches. By simply changing the value in the ASN register during a context switch, the OS can prevent a virtual address from one process from accidentally matching the same virtual address from a previous process in the TLB and/or cache. In the case of an SMT it would seem natural that a separate ASN register be provided within each thread hardware context.
Another important issue is software’s ability to synchronize threads. The Alpha uses a synchronization mechanism based on the load-locked/store-conditional model . This scheme, commonly used by RISC architectures, uses a software based spin loop to set or wait on a semaphore. In a conventional single or multiprocessor system this works well. But on an SMT a spin loop is horrendously wasteful of processing resources. To solve this problem Compaq invented a spin loop quiescing feature that allows the TPU associated with a thread executing a spin loop to be put sleep until the associated semaphore memory location is modified. While asleep the associated thread does not consume any processor resources. This feature adds relatively little extra logic to EV8 because it piggybacks on existing cache coherency mechanisms.
Be the first to discuss this article!