TM granularity and versioning

Article: Haswell Transactional Memory Alternatives
By: Paul A. Clayton (paaronclayton.delete@this.gmail.com), November 21, 2012 9:27 am
Room: Moderated Discussions
Maynard Handley (name99.delete@this.redheron.com) on November 20, 2012 10:52 pm wrote:
> "Intel’s TSX tracks the read-set and write-set at cache line (64B) granularity during a transaction.
> An RS conflict occurs if a cache line in the read-set is written by another thread."
>
> Is it really necessary to do things this way?

No, but it is convenient. Using cache line granularity allows the system to be laid on top of the existing cache coherence system with relatively little modification.

Fine-grained coherence would also tend to increase coherence traffic, though I suspect that complexity factors are more significant in discouraging the use of fine-grained coherence.

Memory versioning would be more complex than the simple implementation presented in the article. If version numbers are assigned strictly at the beginning of a transaction, transactions would have to commit in the order they started; this could substantially delay the commitment of simple transactions that are completely unrelated. (One could abort and retry a lagging transaction to allow other transactions to commit, but that would add complexity especially if the implementation was optimized to allow longer transactions to work well--e.g., by using runahead to prefetch data [and code].)

Resolving transactions into a single consistent sequence is also difficult. If a later transaction speculatively uses a result from an earlier transaction that is aborted, that later transaction would have to be rolled back at least to the point of that speculation (which might force other dependent transactions to partially or fully rollback).

Memory versioning would also work more effectively with at least some write-update coherence (to allow later version transactions with dependences to progress). I suspect there are good reasons why write-update is not common in processors even though it can improve performance. (Predicting when and where to send updates is no-doubt challenging. Matching timing of updates with the memory consistency model might also add complexity.)

I suspect later implementations of TSX might break the initial specification by allowing fine-grained tracking at least for small transactions. (I doubt that much software will be written that depends on cache-line granularity. [I think that Intel should have specified TSX differently, but I suspect that a technically incompatible change would not be problematic.]) I also suspect that some form of versioning/transaction ordering will be developed in later implementations.

For now, getting an implementation out and in use is more important that providing an implementation with all the bells and whistles. It may turn out that the most cost effective microarchitectural changes are different than current knowledge would indicate.

> What I have in mind is something like this:
> - to save memory most code packs a variety of related variables together (ie nearby in the same cache line)
> - these sorts of schemes (track by the cache line) mean that the optimistic viewpoint (transaction goes through
> without having to rollback) IF either no-one else wants to touch this data at the same time OR someone else
> does touch the data at the same time, but everything they touch lives in a different cache block.
>
> These strike me as unsatisfactory conditions. The first requires the lock not to be too busy, the second
> requires that some thought was applied (along with coaxing the compiler in some way) to try to segregate
> different blocks of data into different cache lines. Both of these are inimical to the primary goal
> here, which is that we want to be able to write highly threaded code with just a single BGL (big giant
> lock) which protects pretty much everything, and still have it run fast. (Ie we require our programmers
> to know enough to know that they should protect shared data with the BGL or equivalent, but we don't
> require that they have to micromanage tons of small locks for efficiency's sake.)

It seems you are misunderstanding how BGLs would work in a TM system. The lock itself is placed in the read set, so it can be broadly shared without conflict. (Admittedly, when there is conflict, the BGL will probably be written [conflict back-off strategies might be attempted first] and everything is serialized.)

Cache line granularity would seem to be fairly close to object-level granularity. The probability of conflicting false sharing at the object level would seem to be relatively low since most cache lines would include data from more than a few different objects and the chance that a conflict will occur in those specific objects might be relatively small given a largish set of data, modest parallelism, and generally short transactions.

Optimizations for avoiding coherence false sharing (to avoid forcing a reacquisition of a cache line when another thread writes an unrelated memory location that happens to share a cache line) would apply to avoiding false sharing for TM; so cache-line granularity of transaction tracking may not be that problematic.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Article: Haswell TM AlternativesDavid Kanter08/21/12 10:17 PM
  Article: Haswell TM AlternativesHåkan Winbom08/22/12 12:52 AM
    Article: Haswell TM AlternativesDavid Kanter08/22/12 02:06 AM
  Article: Haswell TM Alternativesanon08/22/12 09:46 AM
    Article: Haswell TM AlternativesLinus Torvalds08/22/12 10:16 AM
      Article: Haswell TM AlternativesDoug S08/24/12 09:34 AM
    AMD's ASF even more limitedPaul A. Clayton08/22/12 10:20 AM
      AMD's ASF even more limitedLinus Torvalds08/22/12 10:41 AM
        Compiler use of ll/sc?Paul A. Clayton08/28/12 10:28 AM
          Compiler use of ll/sc?Linus Torvalds09/08/12 01:58 PM
            Lock recognition?Paul A. Clayton09/10/12 02:17 PM
              Sorry, I was confusedPaul A. Clayton09/13/12 11:56 AM
  Filter to detect store conflictsPaul A. Clayton08/22/12 10:19 AM
  Article: Haswell TM Alternativesbakaneko08/22/12 03:02 PM
    Article: Haswell TM AlternativesDavid Kanter08/22/12 03:45 PM
      Article: Haswell TM Alternativesbakaneko08/22/12 10:56 PM
  Cache line granularity?Paul A. Clayton08/28/12 10:28 AM
    Cache line granularity?David Kanter08/31/12 09:13 AM
      A looser definition might have advantagesPaul A. Clayton09/01/12 07:29 AM
    Cache line granularity?rwessel08/31/12 08:54 PM
      Alpha load locked granularityPaul A. Clayton09/01/12 07:29 AM
        Alpha load locked granularityanon09/02/12 06:23 PM
          Alpha pages groupsPaul A. Clayton09/03/12 05:16 AM
  An alternative implementationMaynard Handley11/20/12 10:52 PM
    An alternative implementationbakaneko11/21/12 06:52 AM
      Guarding unread values?Paul A. Clayton11/21/12 09:39 AM
        Guarding unread values?bakaneko11/21/12 12:25 PM
    TM granularity and versioningPaul A. Clayton11/21/12 09:27 AM
      TM granularity and versioningMaynard Handley11/21/12 11:52 AM
        Indeed, TM (and coherence) has devilish details (NT)Paul A. Clayton11/21/12 11:56 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell blue?