Two of my personal areas of interest and expertise are speculative multithreading (SpMT) and transactional memory (or TM). Both are techniques designed to make multi-core processors and parallel programming more amenable to developers. For several years, I was the co-founder of Strandera, a start up that was developing speculative multithreading, based on transactional memory and dynamic binary translation. Strandera was a wonderful opportunity for me to meet many of the leading academic and commercial researchers in these areas and a number of x86 microprocessor architects over the last 5 years.
A while back, my friend and colleague Andreas Stiller reported a rumor that Intel’s upcoming Haswell microarchitecture would feature transactional memory. This rumor proved to be true recently, when Intel confirmed that Haswell included Transactional Synchronization Extensions, or TSX. This was incredibly exciting for me, and I am happy to take advantage of my expertise and experience to discuss these innovations.
One of the most profound challenges for modern software developers is the emergence of multi-core processors. Over the last 10 years, individual CPU cores have grown faster, but applications must be optimized for parallel execution to take full advantage of the power and performance benefits of Moore’s Law. Previously, parallel programming techniques were mostly used in scientific computing or mission critical and performance sensitive workloads like databases and ERP systems. The multi-core trend meant that developers working on consumer applications were thrust into a world of multithreaded programming and concurrency for the first time.
Transactional memory is a software technique that simplifies writing concurrent programs. TM draws on concepts first developed and established in the database community, which has been dealing with concurrency for roughly 30 years. The idea is to declare a region of code as a transaction. A transaction executes and atomically commits all the results to memory (when the transaction succeeds) or aborts and cancels all the results (if the transaction fails). The key for TM is to provide the Atomicity, Consistency and Isolation qualities that make databases and SQL accessible to ordinary developers. These transactions can safely execute in parallel, which replaces existing painful and bug-prone techniques such as locks and semaphores. There is also a potential performance benefit. Locks are pessimistic and assume that the locking thread will write to the data, so the progress of other threads is blocked. Two transactions which access locked value can proceed in parallel, and a rollback only occurs if one of the transactions writes to the data.
Speculative multithreading (SpMT) focuses on using multiple hardware threads (or multiple cores) to work together and accelerate a single software thread. In essence, the single software thread is speculatively split into multiple threads that can be executed in parallel. Transactions are a natural fit for speculative threads, since it offers an easy way to rollback incorrect speculation. The key advantage of SpMT is that it enables existing single threaded code (which is the vast majority of all software) to reap the benefit of multi-core processors.
Both SpMT and especially TM have a long and distinguished history in academia, which I am exquisitely familiar with from my work at Strandera. TM was first popularized in a paper by Maurice Herlihy and Eliot Moss in 1993, well before multi-core processors. SpMT dates back to Guri Sohi’s Multiscalar project in 1995. Over time both of these areas become very popular for research, and nearly every computer architecture department has projects looking at these two topics. Around 2000, researchers began to look at TM as a technique for multi-core programming and many realized that it was a natural fit with SpMT.
Software TM has been available for a number of years, and many programmers have played around with TM libraries. However, the performance overhead is crippling large, 2-10× slowdowns are common, which precludes widespread use. For TM to be more than a research toy, hardware acceleration and support is necessary.
A technique related to transactional memory, called lock elision, was pioneered by Ravi Rajwar and James Goodman in 2001. Rather than enabling transactions, the idea was to design hardware to handle locks more efficiently. If two threads are predicted to only read locked data, the lock is removed (or elided) and the two threads can execute in parallel. Similar to TM, if a thread writes to the data, the processor must rollback and re-execute using locks for correctness. Lock elision has the advantage of being easily integrated with legacy code, whereas TM requires substantial changes to software.
One of the earliest implementations of transactional memory was the gated store buffer used in Transmeta’s Crusoe and Efficeon processors. However, this was only used to facilitate speculative optimizations for binary translation, rather than any form of SpMT or exposing it directly to programmers. Azul Systems also implemented hardware TM to accelerate their Java appliances, but this was similarly hidden from outsiders. For all intents and purposes, TM mostly languished as merely another interesting idea in academia.
Fortunately, Sun Microsystems decided to implement hardware TM and a limited form of SpMT in the high-end Rock microprocessor. Marc Tremblay, the chief architect, discussed this in papers and talks at numerous conferences. The TM was fairly modest and Sun’s developers showed that it could be used for lock elision and more complex hybrid TM systems, where transactions are handled with a combination of hardware and software. Unfortunately, Rock was cancelled by Sun in 2009. While products never made it out, a number of prototype systems were available to researchers, which was a tremendous help to the community.
In 2009, AMD proposed the Advanced Synchronization Facility, a set of x86 extensions that provide a very limited form of hardware TM support. The goal was to provide hardware primitives that could be used for higher level synchronization such as software TM or lock-free algorithms. However, AMD has not announced whether ASF will be used in products, and if so, in what timeframe.
More recently, IBM announced in 2011 that Blue Gene/Q had hardware support for both TM and SpMT. The TM could be configured in two modes. The first is an unordered and single version mode where a write from one transaction causes a conflict with any transactions reading the same memory address. The second mode is for SpMT and is an ordered, multi-versioned TM. Speculative threads can have different versions of the same memory address, and hardware keep tracks of the age of each thread. The younger threads can access data from older threads (but not the other way around), and writes to the same address are based on thread order. In some cases, dependencies between threads can cause the younger versions to abort.
The most recent development is of course, Intel’s TSX and the implementation in Haswell. Haswell is the first x86 processor to feature hardware transactional memory. Intel’s TSX specification describes how the TM is exposed to programmers, but withholds details on the actual TM implementation. The first section of this article discusses the software interfaces for Intel’s TM. The second section builds on this to analyze the likely implementation details of Intel’s TM, based on my experience.
Discuss (8 comments)