By: Linus Torvalds (torvalds.delete@this.linux-foundation.org), August 22, 2012 9:41 am
Room: Moderated Discussions
Paul A. Clayton (paaronclayton.delete@this.gmail.com) on August 22, 2012 10:20 am wrote:
>
> Many critical sections are small--intentionally so because a long
> critical section is likely to reduce parallelism. Without additional (complex)
> hardware to order transactions so that they do not conflict, most large
> transactions might be likely to fail.
Many critical sections are really small, because they may be implementing things like atomic hash chain updates etc. So the lock may be protecting literally just a small handful of instructions - yet the sequence is not just about one or two words, so ll/sc wouldn't work.
That's where people who try to do scaling at any cost will have per-hashchain locks etc, but that eats memory and can cause other complexity problems (locking when moving things between hashchains atomically etc). So many normal programmers will rely on some generic hash library that probably has just a single lock for all the accesses.
That's a wonderful use-case for hardware lock elision. Keep the lock cacheline shared across CPU's, no need for the complexities of multiple locks, things "just work". And I think it's a fairly common case.
More importantly, even if ll/sc could work in some particular case, realistically people wouldn't actually use ll/sc, because it's basically impossible to do portably, and it's not generally amenable to the compiler doing it automatically. In contrast, lock elision with small transactions is amenable to doing automatically from portable source code.
Note: "portable" here doesn't necessarily mean "works across many architectures". It can mean "works together with code that is written to work for a previous generation of the CPU that didn't have the transactional capability". So when I say "portable", I don't mean the Linux kind of "portable to the 30 different architectures we support", I mean the Windows or OSX kind of "we can use the same codebase without major pain for different versions of Intel chips".
Linus
>
> Many critical sections are small--intentionally so because a long
> critical section is likely to reduce parallelism. Without additional (complex)
> hardware to order transactions so that they do not conflict, most large
> transactions might be likely to fail.
Many critical sections are really small, because they may be implementing things like atomic hash chain updates etc. So the lock may be protecting literally just a small handful of instructions - yet the sequence is not just about one or two words, so ll/sc wouldn't work.
That's where people who try to do scaling at any cost will have per-hashchain locks etc, but that eats memory and can cause other complexity problems (locking when moving things between hashchains atomically etc). So many normal programmers will rely on some generic hash library that probably has just a single lock for all the accesses.
That's a wonderful use-case for hardware lock elision. Keep the lock cacheline shared across CPU's, no need for the complexities of multiple locks, things "just work". And I think it's a fairly common case.
More importantly, even if ll/sc could work in some particular case, realistically people wouldn't actually use ll/sc, because it's basically impossible to do portably, and it's not generally amenable to the compiler doing it automatically. In contrast, lock elision with small transactions is amenable to doing automatically from portable source code.
Note: "portable" here doesn't necessarily mean "works across many architectures". It can mean "works together with code that is written to work for a previous generation of the CPU that didn't have the transactional capability". So when I say "portable", I don't mean the Linux kind of "portable to the 30 different architectures we support", I mean the Windows or OSX kind of "we can use the same codebase without major pain for different versions of Intel chips".
Linus
Topic | Posted By | Date |
---|---|---|
Article: Haswell TM Alternatives | David Kanter | 2012/08/21 09:17 PM |
Article: Haswell TM Alternatives | Håkan Winbom | 2012/08/21 11:52 PM |
Article: Haswell TM Alternatives | David Kanter | 2012/08/22 01:06 AM |
Article: Haswell TM Alternatives | anon | 2012/08/22 08:46 AM |
Article: Haswell TM Alternatives | Linus Torvalds | 2012/08/22 09:16 AM |
Article: Haswell TM Alternatives | Doug S | 2012/08/24 08:34 AM |
AMD's ASF even more limited | Paul A. Clayton | 2012/08/22 09:20 AM |
AMD's ASF even more limited | Linus Torvalds | 2012/08/22 09:41 AM |
Compiler use of ll/sc? | Paul A. Clayton | 2012/08/28 09:28 AM |
Compiler use of ll/sc? | Linus Torvalds | 2012/09/08 12:58 PM |
Lock recognition? | Paul A. Clayton | 2012/09/10 01:17 PM |
Sorry, I was confused | Paul A. Clayton | 2012/09/13 10:56 AM |
Filter to detect store conflicts | Paul A. Clayton | 2012/08/22 09:19 AM |
Article: Haswell TM Alternatives | bakaneko | 2012/08/22 02:02 PM |
Article: Haswell TM Alternatives | David Kanter | 2012/08/22 02:45 PM |
Article: Haswell TM Alternatives | bakaneko | 2012/08/22 09:56 PM |
Cache line granularity? | Paul A. Clayton | 2012/08/28 09:28 AM |
Cache line granularity? | David Kanter | 2012/08/31 08:13 AM |
A looser definition might have advantages | Paul A. Clayton | 2012/09/01 06:29 AM |
Cache line granularity? | rwessel | 2012/08/31 07:54 PM |
Alpha load locked granularity | Paul A. Clayton | 2012/09/01 06:29 AM |
Alpha load locked granularity | anon | 2012/09/02 05:23 PM |
Alpha pages groups | Paul A. Clayton | 2012/09/03 04:16 AM |
An alternative implementation | Maynard Handley | 2012/11/20 09:52 PM |
An alternative implementation | bakaneko | 2012/11/21 05:52 AM |
Guarding unread values? | Paul A. Clayton | 2012/11/21 08:39 AM |
Guarding unread values? | bakaneko | 2012/11/21 11:25 AM |
TM granularity and versioning | Paul A. Clayton | 2012/11/21 08:27 AM |
TM granularity and versioning | Maynard Handley | 2012/11/21 10:52 AM |
Indeed, TM (and coherence) has devilish details (NT) | Paul A. Clayton | 2012/11/21 10:56 AM |