While the Intel TSX documentation is fairly precise in describing the semantics for both RTM and HLE, critical details are not available for the underlying TM and the specific implementation in Haswell. However, there is sufficient information to speculate about the nature of Haswell’s TM. To review what is known, Haswell implements an unordered, single version, nested TM with strong isolation. The TM tracks the read-set and write-set at a fixed 64B cache line granularity.
As a starting assumption, Haswell’s memory hierarchy probably resembles Sandy Bridge in broad strokes, with roughly 32KB L1 and 256KB L2 caches per core (shared by two threads), and an L3 cache shared by all cores. Haswell almost certainly extends Intel’s existing cache behavior and MESIF coherency protocol to support transactional memory semantics. It is far too risky to totally redefine the coherency protocol, which would also vastly increase the challenge of maintaining full compatibility. It is also entirely out of character for Intel, which tends to favor incremental enhancements rather than radical redesigns for mainstream x86 products.
Intel’s server teams take the existing CPU core and build customized blocks including the L3 cache, QPI logic, I/O, memory controllers and power management, tied together with a high performance interconnect fabric. The Haswell coherency changes are very likely restricted to the L1 and L2 caches, strictly within the core. This has the advantage that Intel’s server group can leverage the work in Haswell with minimal additional investment and effort. Most importantly, restricting changes to the core avoids extra validation, which is one of the largest contributors to cost and schedule for server processors.
Haswell’s Transactional Memory
Based on the assumptions above, we can sketch out the most likely implementation details for Haswell’s transactional memory. The most important missing details relate to TM version management, conflict handling and the commit and abort operations.
Haswell’s transactional memory is most likely a deferred update system using the per-core caches for transactional data and register checkpoints. Specifically, the tags for each cache line in the L1D and L2 probably contain bits that indicate whether the line belongs to the read-set (RS) or write-set (WS) for the two threads that can execute on a Haswell core. A store during a transaction will simply write to the cache line, and the shared L3 cache holds the original data (which may require an update to the L3 for the first write in a transaction to a Modified line). This is consistent with the shared and inclusive L3 cache that has been used since Nehalem. To avoid data corruption, any transactional data (i.e. lines in the RS or WS) must stay in the L1D or L2 and not be evicted to the L3 or memory. The architectural registers are checkpointed and stored on-chip, and the transaction can read and write from registers freely.
Alternatively, it is possible that transactions are restricted to Haswell’s L1D cache. This would simplify the design, but make the TM much less robust. The L1D is really not associative enough for modest sized transactions; associativy evictions can occur with as few as 9 cache lines, even though there about 256 lines available per thread. Using the L2 pretty much eliminates most associativity evictions and each thread will have about 2.2K lines on average and a hard maximum of around 4.5K lines. Again, Intel is likely to be using both the L1D and the L2 for transactional data, to avoid the potential associativity problems.
The advantage of the cache-based system I outlined above is that Haswell’s conflict detection can be handled through the existing coherency protocol with only minor enhancements to the memory ordering buffers, L1D and L2 caches. A conflict can occur in exactly three ways, all of which can be easily detected.
The first case to consider is an RS cache line that is written by another core. In Intel’s MESIF protocol, writing to a potentially shared cache line requires invalidating any copies in other private caches. The other core will send a read-for-ownership request to the shared L3 cache, which will check the snoop filter and transmit an invalidate to any cache with a copy of the line. If an L1D or L2 cache receives an invalidate for an RS cache line, this indicates that another thread is attempting to write and a conflict has occurred.
The second case to consider is a WS cache line that is accessed (read or write) by another core. For the transaction to write to the cache line in the first place, no other core can have a copy, which is standard semantics for the Modified coherency state. However, the inclusive L3 cache will retain the out-of-date, original version of the line prior to the transaction. Thus if another core attempts to access the WS cache line, it will miss in the local L1D and L2 caches, and probe the L3. However, the L3 will access the snoop filter and then send the read (or read-for-ownership) request to the core that has the cache line in the WS. If an L1D or L2 cache receives any read request (or an invalidate) for a line in the WS, then a conflict has occurred.
Those two cases handle a different core interfering with a transaction. The third possibility is that another thread on the same Haswell core, which shares the L1D and L2 caches, interferes with the transaction. In this scenario, the other thread will hit in the shared L1D or L2 cache, rather than triggering any coherency traffic through the L3 cache. However, this can be handled in the load pipelines by checking whether the cache line is in the WS for the other thread; and in the store pipeline by checking whether the cache line is in the RS or WS for the other thread.
The last precaution is that if any RS or WS cache line is evicted to the L3 or memory, the transaction must be aborted. In the case of a WS line, this is to prevent the transactional data from overwriting the original copy that is held in the L3 cache. Strictly speaking, RS lines could be held in the L3 cache, but this would complicate matters considerably and it is much easier to simply abort the transaction. It is highly likely that Intel has modified the cache line replacement and eviction policies in the L2 to avoid evicting transactional data. Another advantage of using the L2 cache for transactions is that the L2 eviction algorithms can tolerate significantly more latency than the L1D cache, where they are on the critical path for any L1D miss.
Commit and Abort
To commit a transaction in the system outlined above is a straight forward matter. The L1D and L2 cache controllers make sure that any cache line in the WS is in the Modified state and zero out the WS bits. Similarly, any cache line that is in the RS (but not the WS) must be in the Shared state and the RS bits are zeroed. The old register file checkpoint is removed, and the existing contents of the register file become the architectural state.
The coherency-based conflict detection scheme can identify conflicts early, as soon as they occur (as opposed to at commit time). Once the conflict has been detected early, it is most likely that the transaction is aborted immediately. However, other problems might cause deferred aborts in some implementations.
Aborting transactions is even simpler than committing. The cache controllers change all the WS lines to the Invalid state and zero out the WS bits. The controllers must also zero out the RS bits, but it is not necessary to invalidate the cache lines (although Haswell might do this). Last, the architectural register file is restored from the old checkpoint.
Intel’s Restricted Transactional Memory more or less provides a direct interface to the underlying TM in Haswell using the appropriate instructions. Hardware Lock Elision probably uses the same basic techniques, but requires a little extra buffering. Any locks which are elided are probably pinned in Haswell’s store buffer (or perhaps a small dedicated structure that is accessed in parallel). Thus any access within an HLE region reads the new lock value, while the old value is kept in the cache so that other threads can simultaneously execute.
Discuss (8 comments)