By: Carson (carson.delete@this.example.edu), April 2, 2021 1:11 am
Room: Moderated Discussions
anon2 (anon.delete@this.anon.com) on March 31, 2021 11:03 pm wrote:
> But that's not what TSX is up against. This has the same smell as those early stupid GPGPU benchmarks
> that tried to claim 10x, 100x, 1000x performance improvement over CPU despite not even a single component
> of the GPU having better than 10x advantage over the CPU (FLOPS, memory bandwidth, etc), and many things
> being worse (BW and latency to system memory, instruction latencies, branches, performance per thread
> etc). Most of their speedups came because they completely rewrote the algorithms for vectorization
> and parallelism for the GPU and did not re-test the modern baseline on the CPU.
This is what Linus recently called "optimizing bubblesort", and people have been using it for decades to sell flocks of their particular chickens, from Transputers to SPARC T-series.
It's easy to get great parallel speedup if you hobble the uniprocessor with a shitty algorithm.
Likewise, as you say, it's easy to make your hardware lock elision shine if you encumber the implementation with lots of elidable locks.
> The modern baseline for a central highly contended b-tree is using RCU (or similar) for persistence guarantees
> and fine grained (per node) sequence counters for validity/atomicity of lookup vs updates, and fine grained
> locking pushed down to the lowest common nodes to protect updates from one another (or possibly for simplicity
> a global lock for rebalancing and leaf-node locks for non-rebalancing updates, but b-trees are quite amenable
> to have fine graned rebalance operations because splits and joins frequency drops by an order of magnitude
> or two as you move up levels, if parameters are chosen correctly.
>
> So now what the naive TSX implementation is up against is something with zero atomic operations on the lookup-side,
> compared with what is quite a heavy transaction begin/end in Intel implementations. So TSX is slower right out
> of the gate, which is a big problem (in theory you would think an implementation might be able to make it almost
> zero cost, although complexity and cost is obviously then hidden in the microarchitecture).
You're not wrong, but...
RCUing a non-trivial data structure is decidedly non-trivial. It took years to get lock-free dcache access in the Linux kernel and that complexity imposes an ongoing maintenance cost because it forces complex interfaces on all the surrounding code.
Remember, transactional memory started life as a software technique for writing complex locking code without multiplicative (exponential) complexity blowup.
A hardware technique that makes such code easier to write and maintain, even at the cost of ultimate performance, is valuable. It can even give you a net performance win by making it practical to parallelize software that otherwise wouldn't be worth the bother.
(This is a variant of the "why use compilers when hand-tuned assembler is faster?" argument.)
Second, not all data structures can be RCUed or otherwise given a lock-free fast path. This is the limit case of the preceding problem. Sometimes the semantics you need just don't lend themselves to RCU.
It's not at all clear that the SAP programmers are stupid or lazy; it's quite plausible that they tries quite hard and couldn't find any better alternatives. It's certainly clear that they did not set out to create a star vehicle for TSX.
I don't think either of us really understands the size of the class of problems for which HTM is beneficial.
(Perhaps an intermediate case is a multiprocessor scheduler. The basic design of one priority queue which everyone fights over the head of runs out of steam pretty fast, so a lot of designs are approximations which sacrifice strict prioritization in order to limit contention. If HTM allows more accurate prioritization, that's a win.)
Jumping back to the anti-HTM side, it helps handle more contention, but if your data structure suffers a lot of contention, there are basic speed-of-light limits on the performance that can be achieved. I don't see a way to not need to think very carefully about locking and contention in any parallel application.
> I would say though that I have less confidence in Intel's implementations of such things
> particularly looking at what Apple have done with M1 atomic operations, so it's entirely
> possible a really good design team could come up with an implementation that's significantly
> better and opens up a more significant niche of legacy code that could benefit from it.
Agreed.
> But that's not what TSX is up against. This has the same smell as those early stupid GPGPU benchmarks
> that tried to claim 10x, 100x, 1000x performance improvement over CPU despite not even a single component
> of the GPU having better than 10x advantage over the CPU (FLOPS, memory bandwidth, etc), and many things
> being worse (BW and latency to system memory, instruction latencies, branches, performance per thread
> etc). Most of their speedups came because they completely rewrote the algorithms for vectorization
> and parallelism for the GPU and did not re-test the modern baseline on the CPU.
This is what Linus recently called "optimizing bubblesort", and people have been using it for decades to sell flocks of their particular chickens, from Transputers to SPARC T-series.
It's easy to get great parallel speedup if you hobble the uniprocessor with a shitty algorithm.
Likewise, as you say, it's easy to make your hardware lock elision shine if you encumber the implementation with lots of elidable locks.
> The modern baseline for a central highly contended b-tree is using RCU (or similar) for persistence guarantees
> and fine grained (per node) sequence counters for validity/atomicity of lookup vs updates, and fine grained
> locking pushed down to the lowest common nodes to protect updates from one another (or possibly for simplicity
> a global lock for rebalancing and leaf-node locks for non-rebalancing updates, but b-trees are quite amenable
> to have fine graned rebalance operations because splits and joins frequency drops by an order of magnitude
> or two as you move up levels, if parameters are chosen correctly.
>
> So now what the naive TSX implementation is up against is something with zero atomic operations on the lookup-side,
> compared with what is quite a heavy transaction begin/end in Intel implementations. So TSX is slower right out
> of the gate, which is a big problem (in theory you would think an implementation might be able to make it almost
> zero cost, although complexity and cost is obviously then hidden in the microarchitecture).
You're not wrong, but...
RCUing a non-trivial data structure is decidedly non-trivial. It took years to get lock-free dcache access in the Linux kernel and that complexity imposes an ongoing maintenance cost because it forces complex interfaces on all the surrounding code.
Remember, transactional memory started life as a software technique for writing complex locking code without multiplicative (exponential) complexity blowup.
A hardware technique that makes such code easier to write and maintain, even at the cost of ultimate performance, is valuable. It can even give you a net performance win by making it practical to parallelize software that otherwise wouldn't be worth the bother.
(This is a variant of the "why use compilers when hand-tuned assembler is faster?" argument.)
Second, not all data structures can be RCUed or otherwise given a lock-free fast path. This is the limit case of the preceding problem. Sometimes the semantics you need just don't lend themselves to RCU.
It's not at all clear that the SAP programmers are stupid or lazy; it's quite plausible that they tries quite hard and couldn't find any better alternatives. It's certainly clear that they did not set out to create a star vehicle for TSX.
I don't think either of us really understands the size of the class of problems for which HTM is beneficial.
(Perhaps an intermediate case is a multiprocessor scheduler. The basic design of one priority queue which everyone fights over the head of runs out of steam pretty fast, so a lot of designs are approximations which sacrifice strict prioritization in order to limit contention. If HTM allows more accurate prioritization, that's a win.)
Jumping back to the anti-HTM side, it helps handle more contention, but if your data structure suffers a lot of contention, there are basic speed-of-light limits on the performance that can be achieved. I don't see a way to not need to think very carefully about locking and contention in any parallel application.
> I would say though that I have less confidence in Intel's implementations of such things
> particularly looking at what Apple have done with M1 atomic operations, so it's entirely
> possible a really good design team could come up with an implementation that's significantly
> better and opens up a more significant niche of legacy code that could benefit from it.
Agreed.