By: Paul A. Clayton (paaronclayton.delete@this.gmail.com), December 4, 2014 2:34 pm
Room: Moderated Discussions
Linus Torvalds (torvalds.delete@this.linux-foundation.org) on December 4, 2014 12:05 pm wrote:
[snip]
> On x86 (which, again, is the only architecture you can actually compare these approaches),
> the numbers seem to be that if you update counters, the atomic RMW "add" model gets
> about twice the progress over a "load+add+cmpxchg" model. Twice.
That is an interesting datum, but it is not clear such represents how even a "reasonable effort" LL/SC implementation would perform (much less one that provides strong guarantees). It might even be understating the difference. (Of course, "reasonable effort" is not well defined. I suspect current implementations of LL/SC are generally "low effort", if only because optimizing other areas seems to the architects more profitable.
[snip]
> And yes, you can in theory make LL/SC or load/cmpxchg work like a RMW by noticing the pattern and basically
> turning the bad LL/SC model into an RMW model by generating macro-instructions. And I actually think it's not
> a bad idea. But even if you do that, you basically have to first admit the superiority of the RMW model.
Using idiom recognition to give a LL/SC sequence the same behavior as a RMW instruction is not the same as admitting the architectural superiority of the RMW model. LL/SC facilitates simpler low-end implementations (at the cost of more expensive high-end implementations), allows more flexible use (not limiting use to a single operation — one basic advantage of the RISC bias toward "primitives not solutions" (Of course, this also tends to come with the RISC disadvantages in code density and less clearly communicating higher-level intent.)), and provides some ability to extend the architecture without adding instructions. In theory a carefully-architected form of LL/SC could be extended to a reasonably full-featured transactional memory interface with fewer new instructions. (I suspect a simple single-cache-block reservation implementation could be extended to support multiple writes with the cache-block granule without extreme difficulty. Providing the additional reservation on any cache eviction/invalidation as with Cliff Click's "I wanna bit" would seem to be a further modest extension (if only the reserved block is accessed by the thread, then the other reservation could be ignored).)
The tradeoffs seem to be a bit more complex than "RMW is clearly superior".
[snip]
> On x86 (which, again, is the only architecture you can actually compare these approaches),
> the numbers seem to be that if you update counters, the atomic RMW "add" model gets
> about twice the progress over a "load+add+cmpxchg" model. Twice.
That is an interesting datum, but it is not clear such represents how even a "reasonable effort" LL/SC implementation would perform (much less one that provides strong guarantees). It might even be understating the difference. (Of course, "reasonable effort" is not well defined. I suspect current implementations of LL/SC are generally "low effort", if only because optimizing other areas seems to the architects more profitable.
[snip]
> And yes, you can in theory make LL/SC or load/cmpxchg work like a RMW by noticing the pattern and basically
> turning the bad LL/SC model into an RMW model by generating macro-instructions. And I actually think it's not
> a bad idea. But even if you do that, you basically have to first admit the superiority of the RMW model.
Using idiom recognition to give a LL/SC sequence the same behavior as a RMW instruction is not the same as admitting the architectural superiority of the RMW model. LL/SC facilitates simpler low-end implementations (at the cost of more expensive high-end implementations), allows more flexible use (not limiting use to a single operation — one basic advantage of the RISC bias toward "primitives not solutions" (Of course, this also tends to come with the RISC disadvantages in code density and less clearly communicating higher-level intent.)), and provides some ability to extend the architecture without adding instructions. In theory a carefully-architected form of LL/SC could be extended to a reasonably full-featured transactional memory interface with fewer new instructions. (I suspect a simple single-cache-block reservation implementation could be extended to support multiple writes with the cache-block granule without extreme difficulty. Providing the additional reservation on any cache eviction/invalidation as with Cliff Click's "I wanna bit" would seem to be a further modest extension (if only the reserved block is accessed by the thread, then the other reservation could be ignored).)
The tradeoffs seem to be a bit more complex than "RMW is clearly superior".