Existing x86 Behavior
One alternative to modifying the caches to support TSX is extending the functionality of the Memory Ordering Buffer (MOB) and Re-Order Buffer (ROB) that is found in modern x86 microprocessors. x86 processors have a relatively strong ordering model; generally, loads and stores must appear to execute in program order. However, modern microprocessors can substantially enhance performance by re-ordering loads and stores during out-of-order execution.
The load and store buffers that make up the MOB are critical structures, responsible for maintaining the x86 memory ordering model and handling related functions. Conceptually, the MOB is akin to the ROB, but specialized for loads and stores. The entries in the load and store buffers are kept in program order for correctness, even if the actual memory accesses occur out-of-order to achieve higher performance. The MOB also handles load-to-store forwarding, where a load can get data from a recent store without actually accessing the cache, which saves power and reduces latency.
For specifics, consider the Sandy Bridge core. Sandy Bridge can have 168 total instructions in-flight with the ROB, and 64 loads and 48 stores in the load and store buffers.
When a store is issued to the out-of-order core for renaming and scheduling, an entry in the store buffer is allocated (in-order) for the address and the data. The store buffer will hold the address and data until the instruction has retired and the data has been written to the L1D cache.
Analogously, when a load is issued, an entry in the load buffer is reserved for the address. However, loads must also compare the load address against the contents of the entire store buffer to check for aliasing with older stores. If the load address matches an older store, then the load must wait for the older store to complete to preserve the dependency. Most x86 processors optimize this further, by allowing the store to forward data to the load without accessing the cache. The load buffer entry can be released, once the instruction has retired and the load data is written into the register file.
Because of the strong x86 ordering model, the load buffer is snooped by coherency traffic. A remote store must invalidate all other copies of a cache line. If a cache line is read by a load, and then invalidated by a remote store, the load must be cancelled, since it potentially read invalid data. The x86 memory model does not require snooping the store buffer.
Minor Changes Needed
As Table 2 shows, the behavior of the load and store buffers and register renaming in modern x86 microprocessors is very close to providing transactional memory semantics. The rows in Table 2 are colored according to the changes required. White indicates practically no changes are necessary, grey denotes minor changes, while light purple means major changes are needed.
Haswell could implement TSX with a few modifications to the pipeline. The overall idea is to handle relatively small transactions using out-of-order execution. Before a transaction starts, the old OOO window must be retired and cleared. Transaction commits are handled by retiring the speculative OOO window, and aborts are treated like a pipeline clear (e.g. due to branch misprediction or an exception).
The load buffer essentially tracks the RS already. The only necessary change is avoiding any buffer overflow. The buffer is locked at the start of a transaction, so loads do not leave the RS; additionally, a transaction must be aborted if an entry is not available for allocation. The load buffer already is snooped; if a remote store hits in the buffer, that indicates a conflict has occurred and the transaction must be aborted. Similarly, the ROB must be locked to prevent instructions from retiring and potentially overwriting the saved architectural registers.
Rolling back the RS for a transaction is quite simple for out-of-order microprocessors with branch predictors. Speculative execution already requires undoing changes to the register file (e.g. loading a data value from memory to a register) when a branch is mispredicted. The Re-Order Buffer naturally tracks the old values of the architectural registers. If the processor forces all instructions to retire before starting a transaction, that creates a clean copy of the architectural registers, which can be reloaded after a pipeline clear.
Handling store instructions requires a few more changes. As with loads, the store buffer must be emptied before a transaction begins, and cause an abort if it overflows. Stores are locked in the buffer and are not allowed to writeback to the L1D until the end of a transaction. The store buffer can safely hold the WS, and if an abort occurs, the contents are easily discarded.
However, the store buffer must be snooped by remote loads and stores to track the WS and detect conflicts. This is a substantial change and involves a fair bit of overhead. The store buffer is already on the critical path in Sandy Bridge, because it must be check by every load (potentially twice per cycle). Adding coherency traffic to the buffer would cost additional power and area.
Fortunately, committing or aborting a transaction is relatively easy. The x86 LOCK prefix forces all earlier memory accesses to become globally visible. The store buffer writes the WS data to the L1D cache, and the load and store buffers can be safely emptied. Additionally, the commit would unlock the ROB and could retire any instructions in the pipeline to update the register file with data from the RS. Aborting a transaction is equivalent to handling a mispredicted branch by a machine clear, which flushes all the out-of-order buffers.
Discuss (30 comments)