By: ⚛ (0xe2.0x9a.0x9b.delete@this.gmail.com), July 15, 2015 12:54 pm
Room: Moderated Discussions
Dear Linus,
you persistently keep claiming to know what you are talking about. Unfortunately, that's not the case.
One only needs the simple LOCK CAS (aka CMPXCHG) instruction in SMP with shared memory:
LOCK CAS reg, mem
The semantics of this instruction is: if the LOCK CAS succeeds atomically, it needs to be guaranteed that the CPU core performing the CAS has exclusive access to the contents of [mem] while executing the CAS and has sent all modified bytes from its private L1/L2 cache to all other cores in the system, and that the other cores have confirmed back to the sender that they accepted the bytes and they prevent their modification until the CAS completes (or that they discarded those addresses from their caches and they prevent from populating them until the CAS completes). If multiple CAS with the same "mem" are being executed by distinct CPU cores, only one of the CAS instructions will succeed.
Implementations can vary in how they implement this semantics. Although, most likely there is just a single one, or at most couple of, efficient implementations of this semantics.
If the above synchronizes data only between cores and does not send it to memory as well, add another instruction:
LOCK MCAS reg, mem
This has the same semantics, except that additionally it sends the data to memory and the MCAS cannot complete before the data from the sending core has been written to memory.
A device with DMA access behaves like an extra CPU core adhering to the semantics of LOCK MCAS. Since communication with a device is much slower than memory access, there needs to be a way of narrowing the memory range mapped to the device. Ideally, a HDD/SSD disk with max queue depth of 64 should be able to have up to 64 4K regions mapped at the same time (obviously together with up to 64 distinct memory addresses mentioned directly in the MCAS instruction, living outside of the 4K regions), so that DMA reads/writes can directly target the memory regions where the OS kernel put/expects the 4K of data.
Obviously, if a MCAS is executing (on CPU or on a device) and one or more CAS with the same address are executing, if the MCAS wins the race, all the failed CAS instructions effectively behave as MCAS instructions. On the other hand, if a CAS wins the race, neither the successful CAS nor any of the failed CAS instructions know for sure that there were failed MCAS executing at the same time (one or more MCAS might have been executing and all have failed, but none of the CAS instructions knows about this nor cares about this).
Any semantics beyond that is completely superfluous. Weak vs strong memory ordering discussions - seriously? Those are just unnecessary complications liked by people who prefer too complex designs over simplicity.
Reads and writes can be reordered in any kind of way, as long as per-core sequential semantics is preserved in every CPU core, and of course as long as the above LOCK CAS/MCAS semantics is preserved in the SMP system.
When distinct cores are doing CAS at non-overlapping memory addresses, there is absolutely no need to preserve total memory order. 8 non-overlapping LOCK CAS instructions executing concurrently on 8 cores. Concurrently running with up to 64 HDD/SSD reads and writes. Not to mention the running GPU transfers and the running network data transfers.
Until there is LOCK CAS executing on a CPU core, there is no need for modified cache memory to be exclusively-held cache memory. Distinct CPU cores can have distinct values stored at the same memory address, until LOCK CAS pseudo-randomly chooses which value survives. Obviously, these are programming errors.
This is the most sane and the most simple global memory ordering for SMP. I like it. I say boooo to anything more complicated. So should you.
(For future discussion we can rename LOCK CAS to just CAS, provided that we always keep the above semantics in mind.)
And now something completely different, but absolutely necessary: distributed atomic queue updates, and wakeups. We just need the following function/instructions:
ENQUEUE(mutex, queue_ptr, queue_max, value)
LOCK SLEEP addr // suspend the current core until somebody performs WAKEUP addr
LOCK WAKEUP addr // wakeup all cores suspended at addr
Enqueue has the following semantics:
again:
acquire(mutex) // uses CAS/MCAS in a spin loop
q := *queue_ptr
if q >= queue_max { release(mutex); goto again; }
*q := value
q += 8
*queue_ptr := q
release(mutex) // uses CAS/MCAS in a spin loop
Devices, not just CPU cores, need to be able to execute these as well. Not sure whether ENQUEUE should be an instruction implemented in hardware, probably not.
Well, it looks nice, clean and uniform. And we don't need IRQs.
-atom
Linus Torvalds (torvalds.delete@this.linux-foundation.org) on July 13, 2015 1:46 pm wrote:
> dmcq (dmcq.delete@this.fano.co.uk) on July 13, 2015 12:20 pm wrote:
> >
> > You simply don't seem to be able to acknowledge that supporting that is causing more problems,
> > that Linux is part of a feedback loop leading to more buggy programs being produced and your
> > attitude is part of a problem not a solution.
>
> You seem to not be able to understand reality.
>
> First off, Linux actually _works_ on all the broken memory models.When you say that "Linux
> is part of a feedback loop", you're simply wrong, and full of sh*t. Linux is the the most
> portable project I have ever even remotely heard of in reality. We actually do things right.
> Yes, we've had bugs, but we're pretty much the only project out there that does things like
> very fancy lockless algorithms across pretty much every architecture known to man.
>
> Put another way: I know what the f*ck I'm talking about. Weak memory models are
> objectively inferior to stronger ones, and I've given you the reasons for it.
>
> And my claim is that
>
> (a) weak memory ordering doesn't actually buy you anything but confusion and very
> subtle bugs despite (and sometimes due to) more complex code to handle it.
>
> And this is with people who actually understand memory ordering, and write scalable locking,
> and then occasionally get it wrong because the issues are too damn subtle. Code that looks right,
> and works fine in practice, turns out to have really really subtle issues sometimes.
>
> (b) weak memory ordering ends up being a big pain for software because testing is so inconclusive, and
> developers have a super-hard time to see bugs that people with different hardware may trigger easily.
>
> (c) to make matters worse, some particularly crap versions of weak memory ordering also confused
> the hell out of themselves when it came to the serialization. Power really is a complete disaster.
> The difference between lwsync, isync, sync, eieio, and how they actually interact is insane.
>
> (d) weak memory models do not even perform better! It's the one thing that people claim is their
> advantage, and I call BS on that claim. I claim that weak memory models actually perform worse, because
> they end up adding synchronization in places where none is needed, and it would be better to just speculate
> wildly (and in the very unusual case of a conflict, redo the operations in-order).
>
> What part of the above can you not acknowledge? You don't seem to really get that reality is
> tough, and that testing is important, and that bugs happen. You keep living in some fairy land
> where it's ok to say "tough, you shouldn't have bugs, especially in really subtle code that is
> complicated further by the memory ordering making it impossible to test or even think about".
>
> Linus
you persistently keep claiming to know what you are talking about. Unfortunately, that's not the case.
One only needs the simple LOCK CAS (aka CMPXCHG) instruction in SMP with shared memory:
LOCK CAS reg, mem
The semantics of this instruction is: if the LOCK CAS succeeds atomically, it needs to be guaranteed that the CPU core performing the CAS has exclusive access to the contents of [mem] while executing the CAS and has sent all modified bytes from its private L1/L2 cache to all other cores in the system, and that the other cores have confirmed back to the sender that they accepted the bytes and they prevent their modification until the CAS completes (or that they discarded those addresses from their caches and they prevent from populating them until the CAS completes). If multiple CAS with the same "mem" are being executed by distinct CPU cores, only one of the CAS instructions will succeed.
Implementations can vary in how they implement this semantics. Although, most likely there is just a single one, or at most couple of, efficient implementations of this semantics.
If the above synchronizes data only between cores and does not send it to memory as well, add another instruction:
LOCK MCAS reg, mem
This has the same semantics, except that additionally it sends the data to memory and the MCAS cannot complete before the data from the sending core has been written to memory.
A device with DMA access behaves like an extra CPU core adhering to the semantics of LOCK MCAS. Since communication with a device is much slower than memory access, there needs to be a way of narrowing the memory range mapped to the device. Ideally, a HDD/SSD disk with max queue depth of 64 should be able to have up to 64 4K regions mapped at the same time (obviously together with up to 64 distinct memory addresses mentioned directly in the MCAS instruction, living outside of the 4K regions), so that DMA reads/writes can directly target the memory regions where the OS kernel put/expects the 4K of data.
Obviously, if a MCAS is executing (on CPU or on a device) and one or more CAS with the same address are executing, if the MCAS wins the race, all the failed CAS instructions effectively behave as MCAS instructions. On the other hand, if a CAS wins the race, neither the successful CAS nor any of the failed CAS instructions know for sure that there were failed MCAS executing at the same time (one or more MCAS might have been executing and all have failed, but none of the CAS instructions knows about this nor cares about this).
Any semantics beyond that is completely superfluous. Weak vs strong memory ordering discussions - seriously? Those are just unnecessary complications liked by people who prefer too complex designs over simplicity.
Reads and writes can be reordered in any kind of way, as long as per-core sequential semantics is preserved in every CPU core, and of course as long as the above LOCK CAS/MCAS semantics is preserved in the SMP system.
When distinct cores are doing CAS at non-overlapping memory addresses, there is absolutely no need to preserve total memory order. 8 non-overlapping LOCK CAS instructions executing concurrently on 8 cores. Concurrently running with up to 64 HDD/SSD reads and writes. Not to mention the running GPU transfers and the running network data transfers.
Until there is LOCK CAS executing on a CPU core, there is no need for modified cache memory to be exclusively-held cache memory. Distinct CPU cores can have distinct values stored at the same memory address, until LOCK CAS pseudo-randomly chooses which value survives. Obviously, these are programming errors.
This is the most sane and the most simple global memory ordering for SMP. I like it. I say boooo to anything more complicated. So should you.
(For future discussion we can rename LOCK CAS to just CAS, provided that we always keep the above semantics in mind.)
And now something completely different, but absolutely necessary: distributed atomic queue updates, and wakeups. We just need the following function/instructions:
ENQUEUE(mutex, queue_ptr, queue_max, value)
LOCK SLEEP addr // suspend the current core until somebody performs WAKEUP addr
LOCK WAKEUP addr // wakeup all cores suspended at addr
Enqueue has the following semantics:
again:
acquire(mutex) // uses CAS/MCAS in a spin loop
q := *queue_ptr
if q >= queue_max { release(mutex); goto again; }
*q := value
q += 8
*queue_ptr := q
release(mutex) // uses CAS/MCAS in a spin loop
Devices, not just CPU cores, need to be able to execute these as well. Not sure whether ENQUEUE should be an instruction implemented in hardware, probably not.
Well, it looks nice, clean and uniform. And we don't need IRQs.
-atom
Linus Torvalds (torvalds.delete@this.linux-foundation.org) on July 13, 2015 1:46 pm wrote:
> dmcq (dmcq.delete@this.fano.co.uk) on July 13, 2015 12:20 pm wrote:
> >
> > You simply don't seem to be able to acknowledge that supporting that is causing more problems,
> > that Linux is part of a feedback loop leading to more buggy programs being produced and your
> > attitude is part of a problem not a solution.
>
> You seem to not be able to understand reality.
>
> First off, Linux actually _works_ on all the broken memory models.When you say that "Linux
> is part of a feedback loop", you're simply wrong, and full of sh*t. Linux is the the most
> portable project I have ever even remotely heard of in reality. We actually do things right.
> Yes, we've had bugs, but we're pretty much the only project out there that does things like
> very fancy lockless algorithms across pretty much every architecture known to man.
>
> Put another way: I know what the f*ck I'm talking about. Weak memory models are
> objectively inferior to stronger ones, and I've given you the reasons for it.
>
> And my claim is that
>
> (a) weak memory ordering doesn't actually buy you anything but confusion and very
> subtle bugs despite (and sometimes due to) more complex code to handle it.
>
> And this is with people who actually understand memory ordering, and write scalable locking,
> and then occasionally get it wrong because the issues are too damn subtle. Code that looks right,
> and works fine in practice, turns out to have really really subtle issues sometimes.
>
> (b) weak memory ordering ends up being a big pain for software because testing is so inconclusive, and
> developers have a super-hard time to see bugs that people with different hardware may trigger easily.
>
> (c) to make matters worse, some particularly crap versions of weak memory ordering also confused
> the hell out of themselves when it came to the serialization. Power really is a complete disaster.
> The difference between lwsync, isync, sync, eieio, and how they actually interact is insane.
>
> (d) weak memory models do not even perform better! It's the one thing that people claim is their
> advantage, and I call BS on that claim. I claim that weak memory models actually perform worse, because
> they end up adding synchronization in places where none is needed, and it would be better to just speculate
> wildly (and in the very unusual case of a conflict, redo the operations in-order).
>
> What part of the above can you not acknowledge? You don't seem to really get that reality is
> tough, and that testing is important, and that bugs happen. You keep living in some fairy land
> where it's ok to say "tough, you shouldn't have bugs, especially in really subtle code that is
> complicated further by the memory ordering making it impossible to test or even think about".
>
> Linus