By: Linus Torvalds (torvalds.delete@this.linux-foundation.org), August 24, 2022 11:56 am
Room: Moderated Discussions
dmcq (dmcq.delete@this.fano.co.uk) on August 24, 2022 3:50 am wrote:
>
> You have a good point there. I can't say I'm exactly enthused by the memory barriers implicit
> in the read or write of an address in RCU but I guess Linus would go on about the Alpha :-) Anyway
> I guess a load acquire and store release would fix that if it really caused problems.
Actually, RCU optimally wants something that is intentionally weaker than acquire and release.
IOW, for every architecture except for alpha, RCU doesn't need any memory barriers at all. It's ok with pure "sane weakly ordered accesses".
So RCU mostly works with just a model of "single word-sized access gets us a consistent value" (in the kernel we call this READ_ONCE() and WRITE_ONCE()) with no hw ordering requirements (we do have some compiler barriers to make sure that we do get that single access). So on arm and powerpc it turns into simple load/store instructions.
Now, RCU then has other synchronization requirements (ie writers do need proper exclusion from each other, and you have other synchronization that makes the whole unlocked readers possible in the first place), but for the cases RCU is good for (ie the common "99% readers" case with the kinds of data structures that work with RCU) you really don't need even acquire/release.
The only thing that RCU read-side synchronization needs is really that dependent loads give a consistent view: if you load a pointer from memory, and then load a value off that pointer, then there's an implied ordering between the two on the local CPU (ie the first pointer read has to happen before the read that the pointer points to).
That would seem to be something that is almost impossible to not get, but alpha did indeed manage to screw even that ordering up. Even if a remote CPU ordered its writes explicitly with a write barrier, and even when the local CPU does data-dependent reads, alpha could return a stale value for the second read because of how the caches were organized on some implementations.
Anyway, it just means that on alpha, we now do extra crazy things for READ_ONCE() in order to avoid having to do extra crazy things anywhere else.
(And yes, it means that CPU designers that think about doing things like value prediction etc in hardware need to be aware that they still have to follow memory ordering semantics).
In the kernel, we end up depending on other implied barriers too, because explicit barriers are often too expensive.
For example, we have this notion of a "write barrier due to conditional", where we know that a simple conditional will be a barrier between a read and a write if the conditional depends on the value we read. The CPU can wildly speculate and do anything it really wants, but we know that the write can not possibly be visible on another CPU before the CPU has completed the read (because that would mean that possibly wrong values would be seen on other CPUs - we do not allow memory ordering that gives you values that have never been written), so even without any explicit memory ordering we know the accesses are ordered.
Would I encourage others to do the kinds of things the kernel does? Not in application code, no. It just requires people to care too much about the kinds of details that you simply shouldn't care about. But in core libraries and language runtimes, absolutely yes.
So if you are writing a memory allocation library or you have some very core hash table in your runtime, you'd probably end up doing a lot of the same things we do in the kernel. And in Linux we end up actually exporting some interfaces to make that possible (google "RCU in user space" and "rseq in user space").
Linus
>
> You have a good point there. I can't say I'm exactly enthused by the memory barriers implicit
> in the read or write of an address in RCU but I guess Linus would go on about the Alpha :-) Anyway
> I guess a load acquire and store release would fix that if it really caused problems.
Actually, RCU optimally wants something that is intentionally weaker than acquire and release.
IOW, for every architecture except for alpha, RCU doesn't need any memory barriers at all. It's ok with pure "sane weakly ordered accesses".
So RCU mostly works with just a model of "single word-sized access gets us a consistent value" (in the kernel we call this READ_ONCE() and WRITE_ONCE()) with no hw ordering requirements (we do have some compiler barriers to make sure that we do get that single access). So on arm and powerpc it turns into simple load/store instructions.
Now, RCU then has other synchronization requirements (ie writers do need proper exclusion from each other, and you have other synchronization that makes the whole unlocked readers possible in the first place), but for the cases RCU is good for (ie the common "99% readers" case with the kinds of data structures that work with RCU) you really don't need even acquire/release.
The only thing that RCU read-side synchronization needs is really that dependent loads give a consistent view: if you load a pointer from memory, and then load a value off that pointer, then there's an implied ordering between the two on the local CPU (ie the first pointer read has to happen before the read that the pointer points to).
That would seem to be something that is almost impossible to not get, but alpha did indeed manage to screw even that ordering up. Even if a remote CPU ordered its writes explicitly with a write barrier, and even when the local CPU does data-dependent reads, alpha could return a stale value for the second read because of how the caches were organized on some implementations.
Anyway, it just means that on alpha, we now do extra crazy things for READ_ONCE() in order to avoid having to do extra crazy things anywhere else.
(And yes, it means that CPU designers that think about doing things like value prediction etc in hardware need to be aware that they still have to follow memory ordering semantics).
In the kernel, we end up depending on other implied barriers too, because explicit barriers are often too expensive.
For example, we have this notion of a "write barrier due to conditional", where we know that a simple conditional will be a barrier between a read and a write if the conditional depends on the value we read. The CPU can wildly speculate and do anything it really wants, but we know that the write can not possibly be visible on another CPU before the CPU has completed the read (because that would mean that possibly wrong values would be seen on other CPUs - we do not allow memory ordering that gives you values that have never been written), so even without any explicit memory ordering we know the accesses are ordered.
Would I encourage others to do the kinds of things the kernel does? Not in application code, no. It just requires people to care too much about the kinds of details that you simply shouldn't care about. But in core libraries and language runtimes, absolutely yes.
So if you are writing a memory allocation library or you have some very core hash table in your runtime, you'd probably end up doing a lot of the same things we do in the kernel. And in Linux we end up actually exporting some interfaces to make that possible (google "RCU in user space" and "rseq in user space").
Linus