By: EduardoS (no.delete@this.spam.com), July 15, 2015 2:31 pm
Room: Moderated Discussions
⚛ (0xe2.0x9a.0x9b.delete@this.gmail.com) on July 15, 2015 12:54 pm wrote:
> Dear Linus,
>
> you persistently keep claiming to know what you are talking about. Unfortunately, that's not the case.
>
> One only needs the simple LOCK CAS (aka CMPXCHG) instruction in SMP with shared memory:
>
> LOCK CAS reg, mem
>
> The semantics of this instruction is: if the LOCK CAS succeeds atomically, it needs to be guaranteed
> that the CPU core performing the CAS has exclusive access to the contents of [mem] while executing
> the CAS and has sent all modified bytes from its private L1/L2 cache to all other cores in the system,
> and that the other cores have confirmed back to the sender that they accepted the bytes and they prevent
> their modification until the CAS completes (or that they discarded those addresses from their caches
> and they prevent from populating them until the CAS completes). If multiple CAS with the same "mem"
> are being executed by distinct CPU cores, only one of the CAS instructions will succeed.
>
> Implementations can vary in how they implement this semantics. Although, most likely there
> is just a single one, or at most couple of, efficient implementations of this semantics.
>
> If the above synchronizes data only between cores and does
> not send it to memory as well, add another instruction:
>
> LOCK MCAS reg, mem
>
> This has the same semantics, except that additionally it sends the data to memory and the
> MCAS cannot complete before the data from the sending core has been written to memory.
>
> A device with DMA access behaves like an extra CPU core adhering to the semantics of LOCK MCAS. Since
> communication with a device is much slower than memory access, there needs to be a way of narrowing the
> memory range mapped to the device. Ideally, a HDD/SSD disk with max queue depth of 64 should be able
> to have up to 64 4K regions mapped at the same time (obviously together with up to 64 distinct memory
> addresses mentioned directly in the MCAS instruction, living outside of the 4K regions), so that DMA
> reads/writes can directly target the memory regions where the OS kernel put/expects the 4K of data.
>
> Obviously, if a MCAS is executing (on CPU or on a device) and one or more CAS with the same address are executing,
> if the MCAS wins the race, all the failed CAS instructions effectively behave as MCAS instructions. On the
> other hand, if a CAS wins the race, neither the successful CAS nor any of the failed CAS instructions know
> for sure that there were failed MCAS executing at the same time (one or more MCAS might have been executing
> and all have failed, but none of the CAS instructions knows about this nor cares about this).
>
> Any semantics beyond that is completely superfluous. Weak vs strong memory ordering discussions - seriously?
> Those are just unnecessary complications liked by people who prefer too complex designs over simplicity.
>
> Reads and writes can be reordered in any kind of way, as long as per-core sequential semantics is preserved in
> every CPU core, and of course as long as the above LOCK CAS/MCAS semantics is preserved in the SMP system.
>
> When distinct cores are doing CAS at non-overlapping memory addresses, there is absolutely
> no need to preserve total memory order. 8 non-overlapping LOCK CAS instructions executing
> concurrently on 8 cores. Concurrently running with up to 64 HDD/SSD reads and writes. Not
> to mention the running GPU transfers and the running network data transfers.
>
> Until there is LOCK CAS executing on a CPU core, there is no need for modified cache memory to be exclusively-held
> cache memory. Distinct CPU cores can have distinct values stored at the same memory address, until LOCK
> CAS pseudo-randomly chooses which value survives. Obviously, these are programming errors.
>
> This is the most sane and the most simple global memory ordering for SMP.
> I like it. I say boooo to anything more complicated. So should you.
>
> (For future discussion we can rename LOCK CAS to just CAS,
> provided that we always keep the above semantics in mind.)
>
> And now something completely different, but absolutely necessary: distributed atomic
> queue updates, and wakeups. We just need the following function/instructions:
>
> ENQUEUE(mutex, queue_ptr, queue_max, value)
>
> LOCK SLEEP addr // suspend the current core until somebody performs WAKEUP addr
>
> LOCK WAKEUP addr // wakeup all cores suspended at addr
>
> Enqueue has the following semantics:
>
> again:
> acquire(mutex) // uses CAS/MCAS in a spin loop
> q := *queue_ptr
> if q >= queue_max { release(mutex); goto again; }
> *q := value
> q += 8
> *queue_ptr := q
> release(mutex) // uses CAS/MCAS in a spin loop
>
> Devices, not just CPU cores, need to be able to execute these as well. Not sure
> whether ENQUEUE should be an instruction implemented in hardware, probably not.
>
> Well, it looks nice, clean and uniform. And we don't need IRQs.
>
> -atom
Why so much complexity? All you need for locking is an instruction like x86 XCHG reg, mem, or even BTS, or locked OR, locking is very easy.
But for efficiency you may need something else.
> Dear Linus,
>
> you persistently keep claiming to know what you are talking about. Unfortunately, that's not the case.
>
> One only needs the simple LOCK CAS (aka CMPXCHG) instruction in SMP with shared memory:
>
> LOCK CAS reg, mem
>
> The semantics of this instruction is: if the LOCK CAS succeeds atomically, it needs to be guaranteed
> that the CPU core performing the CAS has exclusive access to the contents of [mem] while executing
> the CAS and has sent all modified bytes from its private L1/L2 cache to all other cores in the system,
> and that the other cores have confirmed back to the sender that they accepted the bytes and they prevent
> their modification until the CAS completes (or that they discarded those addresses from their caches
> and they prevent from populating them until the CAS completes). If multiple CAS with the same "mem"
> are being executed by distinct CPU cores, only one of the CAS instructions will succeed.
>
> Implementations can vary in how they implement this semantics. Although, most likely there
> is just a single one, or at most couple of, efficient implementations of this semantics.
>
> If the above synchronizes data only between cores and does
> not send it to memory as well, add another instruction:
>
> LOCK MCAS reg, mem
>
> This has the same semantics, except that additionally it sends the data to memory and the
> MCAS cannot complete before the data from the sending core has been written to memory.
>
> A device with DMA access behaves like an extra CPU core adhering to the semantics of LOCK MCAS. Since
> communication with a device is much slower than memory access, there needs to be a way of narrowing the
> memory range mapped to the device. Ideally, a HDD/SSD disk with max queue depth of 64 should be able
> to have up to 64 4K regions mapped at the same time (obviously together with up to 64 distinct memory
> addresses mentioned directly in the MCAS instruction, living outside of the 4K regions), so that DMA
> reads/writes can directly target the memory regions where the OS kernel put/expects the 4K of data.
>
> Obviously, if a MCAS is executing (on CPU or on a device) and one or more CAS with the same address are executing,
> if the MCAS wins the race, all the failed CAS instructions effectively behave as MCAS instructions. On the
> other hand, if a CAS wins the race, neither the successful CAS nor any of the failed CAS instructions know
> for sure that there were failed MCAS executing at the same time (one or more MCAS might have been executing
> and all have failed, but none of the CAS instructions knows about this nor cares about this).
>
> Any semantics beyond that is completely superfluous. Weak vs strong memory ordering discussions - seriously?
> Those are just unnecessary complications liked by people who prefer too complex designs over simplicity.
>
> Reads and writes can be reordered in any kind of way, as long as per-core sequential semantics is preserved in
> every CPU core, and of course as long as the above LOCK CAS/MCAS semantics is preserved in the SMP system.
>
> When distinct cores are doing CAS at non-overlapping memory addresses, there is absolutely
> no need to preserve total memory order. 8 non-overlapping LOCK CAS instructions executing
> concurrently on 8 cores. Concurrently running with up to 64 HDD/SSD reads and writes. Not
> to mention the running GPU transfers and the running network data transfers.
>
> Until there is LOCK CAS executing on a CPU core, there is no need for modified cache memory to be exclusively-held
> cache memory. Distinct CPU cores can have distinct values stored at the same memory address, until LOCK
> CAS pseudo-randomly chooses which value survives. Obviously, these are programming errors.
>
> This is the most sane and the most simple global memory ordering for SMP.
> I like it. I say boooo to anything more complicated. So should you.
>
> (For future discussion we can rename LOCK CAS to just CAS,
> provided that we always keep the above semantics in mind.)
>
> And now something completely different, but absolutely necessary: distributed atomic
> queue updates, and wakeups. We just need the following function/instructions:
>
> ENQUEUE(mutex, queue_ptr, queue_max, value)
>
> LOCK SLEEP addr // suspend the current core until somebody performs WAKEUP addr
>
> LOCK WAKEUP addr // wakeup all cores suspended at addr
>
> Enqueue has the following semantics:
>
> again:
> acquire(mutex) // uses CAS/MCAS in a spin loop
> q := *queue_ptr
> if q >= queue_max { release(mutex); goto again; }
> *q := value
> q += 8
> *queue_ptr := q
> release(mutex) // uses CAS/MCAS in a spin loop
>
> Devices, not just CPU cores, need to be able to execute these as well. Not sure
> whether ENQUEUE should be an instruction implemented in hardware, probably not.
>
> Well, it looks nice, clean and uniform. And we don't need IRQs.
>
> -atom
Why so much complexity? All you need for locking is an instruction like x86 XCHG reg, mem, or even BTS, or locked OR, locking is very easy.
But for efficiency you may need something else.