Weighing the Options
The alternative MOB-based TSX implementation is quite different from a cache based scheme, with a variety of advantages and disadvantages.
In some sense, it is simpler. Rather than adding new capabilities to the caches, only a few changes are needed to the MOB and ROB. This limits the scope of the hardware changes and complexity to within the processor core.
From a programmers standpoint, it is also much easier to reason about. Cache-based systems have potential limits due to associativity, and certain access patterns will cause transaction aborts due to cache contention. In a MOB-based design, as long as transactions are kept smaller than the capacity of the ROB and MOB, the data access patterns and contention do not matter at all. Realistically, the limits on the size of MOB-based transactions might be smaller (i.e. fewer total loads and stores) than in a cache-based TM, but the predictability is a huge benefit.
The downside is that while the MOB-based TM might be less complex, that complexity is in the worst possible place. The ROB and MOB are very power hungry and timing critical structures. Since they can be accessed multiple times every cycle, latency is a critical issue. Adding extra ports to the MOB to handle coherency traffic is quite likely to be expensive. In contrast, the L1D and especially the L2 are less timing sensitive. Moreover, if the TM system has a bug, it is easier to disable extra cache functionality than trying to fix the MOB and ROB.
Another factor is that the MOB-based TM would yield relatively bursty write traffic to the L1D. At the end of each transaction, the processor would need to write the WS to the cache, potentially draining 48 or more stores. This would keep the cache quite busy and introduce a bit of overhead to transactions.
Overall, Haswell is more likely to use the cache-based TM system. It is a much less risky implementation choice. Transactional memory is just as complex as simultaneous multi-threading, and Intel’s SMT was specifically designed so that it could be disabled in early versions (e.g. 180nm and early 130nm variants of the P4). Intel’s architects would want a similar sort of ‘off switch’ on TSX, and that is much easier when using the caches. It is also possible that some of the cache enhancements could synergize with the new AVX gather instructions. Last, two of Intel’s patent applications match up nicely with a cache-based system, even using similar instruction mnemonics to TSX.
However, it is important to note that the MOB-based and cache-based TM systems are quite orthogonal. Either option is perfectly valid and the two could be combined together. A combined system would be ideal, eliminating associativity issues with the MOB, and enabling large transactions that spill into the cache without bursty writebacks at the commit point.
Intel’s x86 design team favor incremental improvements, rather than revolutionary changes that might flop (e.g. Itanium, Pentium 4). In all likelihood, a cache-based implementation of TSX for Haswell is a starting point on the roadmap. Skylake or another Haswell successor is probably slated to use a combined TM system.
Simply providing a minimal implementation in Haswell will give software developers a huge impetus to explore the TM design and take advantage of TSX. Then Intel’s architects can incorporate feedback from the software community into future versions. This is highly consistent with Intel’s business needs (low risk for the mainstream x86 products) and design philosophy (continuous evolution).
In summary, a variety of factors make a MOB-based implementation of TSX relatively unlikely for Haswell. However, it is important to understand the trade-offs, because it is a viable option and those same factors suggest that a combined TM system would have the best of the both worlds and may show up in future x86 processors.
Discuss (30 comments)