Intel’s Transactional Synchronization Extensions
The TSX specification provides two interfaces for programmers to take advantage of transactional memory. The first is Hardware Lock Elision (HLE), which maps very closely to the previous work by Rajwar and Goodman. This should be no surprise, as Ravi Rajwar has worked at Intel for quite some time. The second mode is Restricted Transactional Memory (RTM), which resembles classical TM proposals. Both use new instructions to take advantage of the underlying TM hardware, but the objectives are fairly different.
The underlying TM tracks the read-set and write-set of a transaction, at a 64B cache line granularity. The read-set and write-set are respectively all the cache lines that the transaction has read from, or written to during execution. A transaction encounters a conflict if a cache line in its read-set is written by another thread, or if a cache line in its write-set is read or written by another thread. For those familiar with TM terminology, this is known as strong isolation, since non-transactional memory accesses can cause a transaction to abort. Conflicts typically cause the transaction to abort, and false conflicts within a cache line can occur.
TSX can have transactions nested inside each other, which is conceptually handled by flattening the nest into a single transaction. However, there is an implementation specific limit to the amount of nesting, exceeding this limit will cause an abort. Any abort inside a nested transaction will abort all the nested transactions.
Transactions can only be used with write-back cacheable memory operations, but are available at all privilege levels. Not all x86 instructions can be used safely inside of a transaction. There are several x86 instructions that will cause any transaction (for HLE or RTM) to abort, in particular CPUID and PAUSE.
In addition, there are a number of instructions that may cause an abort on specific implementations. These instructions include x87 and MMX, mixed access to XMM and YMM registers, updates to non-status parts of EFLAGs, updating segment, debug or control registers, ring transitions, cache and TLB control instructions, any non-writeback memory type accesses, processor state save, interrupts, I/O, virtualization (VMX), trusted execution (SMX) and several miscellaneous types. Note that this means that a context switch, which typically uses state saving instructions, will almost always abort a transaction. Generally, TSX implementations are intended to be upwards compatible with respect to instructions. For example, if one implementation adds support for VZEROUPPER in a transaction, that capability will not be removed in any future versions.
There are also run-time behaviors that may cause a transaction to abort. Most faults and traps will cause an abort, while synchronous exceptions and asynchronous events may cause an abort. Self-modifying code and accesses to non-write-back memory types may also cause aborts.
There are also implementation specific limits to the size of transactions, and various microarchitectural buffers can be limiting factors. Generally, Intel provides no guarantees about transactional execution. While this is frustrating to programmers because it requires a non-transactional fallback path, this avoids any backwards compatibility issues in the future. It also avoids deadlocking the system if the programmer writes a transaction which cannot possibly succeed.
Restricted Transactional Memory
Ironically, RTM is the more powerful and disruptive of the two techniques, but is conceptually simpler. RTM exposes nested transactional memory to the programmer, which is a very powerful model for expressing parallelism. The downside is that it does require developing separate code for transactional execution, rather than inserting hints into existing code.
RTM is an option for developers, but not a requirement. Code that is outside of a transaction does not have to obey the restrictions mentioned previously. This is necessary, given all the requirements for transactions. It also gives Intel the luxury of introducing new x86 features, without worrying about TM implementation details or overhead.
There are three new instructions, XBEGIN, XEND and XABORT. The XBEGIN instruction starts a transaction and also provides a 16-bit or 32-bit offset to a fallback address. If at any time, the transaction aborts, the thread will resume execution at the fallback address. Throughout the transactional execution, memory accesses are tracked using the read-set and write-set to detect conflicts.
If a conflict occurs during a transaction, it may trigger an abort. However, the actual outcome is implementation specific. For example, if two transactions conflict, both could be aborted. A more sensible approach is to abort only one of the transactions, but consistent rules to determine which one is successful would be helpful.
The XEND instruction indicates the end of a transaction. If the XEND instruction is the last in a nest of transactions (i.e. the outermost transaction), then it will attempt to atomically commit the changes. A successful commit will overwrite the old architectural state including both registers and memory, with new values generated during the transaction. From an ordering perspective, all the memory operations appear to have executed atomically. If the commit fails, then the changes to architectural state are aborted.
An abort simply rolls back any and all changes to architectural state that occurred during transactional execution, restoring the state prior to the first XBEGIN instruction (i.e. the outermost transaction). This includes any writes to the architectural registers or memory. Any abort within a nested set of transactions will cause the entire nest to abort. In addition, the abort updates the EAX register with an 8-bit value that potentially describes the cause of the abort. Last, the execution is redirected to the fallback point. If a nested set of transactions aborts, then the fallback address is taken from the first XBEGIN instruction (i.e. the outermost transaction).
The XABORT instruction immediately triggers an abort, just as if a commit had been unsuccessful. An explicit abort command is useful when the programmers can determine that a transaction is going to fail, without any help from the hardware. Aborting the transaction early can help reduce the performance and power penalty.
Hardware Lock Elision
HLE is much easier to integrated with existing x86 software and is also backwards compatible with x86 microprocessors that do not have TSX support. Like its namesake, the goal is to improve the performance of synchronization by enabling simultaneous non-conflicting accesses to shared data.
HLE introduces two new instruction hint prefixes, called XACQUIRE and XRELEASE that are used to denote the bounds of the lock elision. XACQUIRE is a prefix for instructions that acquire a lock. It indicates the start of a region for lock elision. The memory address of the lock instruction is added to the read-set of a transaction, but does not write new data to the lock address.
The thread then enters transactional execution and continues on to the instructions inside the HLE region, adding memory accesses to the read-set and write-set. Any read to the lock address from within the HLE region will return the new data, but any read from another thread will return the old data. This enables many threads to simultaneously acquire a lock and make non-conflicting memory accesses to shared data.
XRELEASE is a prefix that is used for the instruction that releases the lock address, and it marks the end of the HLE region. Since the lock address was not written to at the start of the region, no further writes are needed to restore it to the original value. At the outermost XRELEASE, the processor attempts to commit the transaction; if successful then the HLE region was executed without acquiring or releasing the lock.
If a conflict occurs, then the processor will restore the architectural register state prior to XACQUIRE and discard any writes to memory from the HLE region. The thread will execute the region again, without HLE and with the standard pessimistic locking behavior. Any lock which crosses a cache line cannot be elided and will automatically trigger re-execution without HLE.
There is also an implementation specific limit to the number of locks that can be elided simultaneously; any further locks will be executed normally. While nested HLE is available, recursive locking is not supported. Writing to the lock address from within the HLE region will cause an HLE abort.
Another potential source of HLE aborts is locks that reside on the same cache line. Since Intel’s TM can only track at cache line granularity, two locks on the same cache line will cause aliasing and unnecessary HLE aborts.
Hardware without HLE support will simply ignore the hint prefixes and also execute the region with standard behavior. One of the key benefits is that HLE is compatible with existing lock-based programming and takes little developer effort to implement.
Discuss (8 comments)