By: anon (anon.delete@this.anon.com), December 3, 2014 5:08 pm
Room: Moderated Discussions
Linus Torvalds (torvalds.delete@this.linux-foundation.org) on December 3, 2014 11:15 am wrote:
> Andreas (kingmouf.delete@this.gmail.com) on December 3, 2014 6:51 am wrote:
> >
> > I think that we should realize that the world does not spin only around iDevices and Apple. In my point
> > of view this is more geared towards larger systems and servers rather than tablets and phones.
>
> Yes. From what I've seen, the advantage of atomic RMW ops is under heavy contention,
> when they are better at making progress than the equivalent "read/op/cmpxchg" loop.
>
> Obviously, I can compare mainly against x86, since that is the only other relevant architecture that
> has RMW operations. And x86 doesn't have the whole "load-linked" and "store-conditional" model (ARMv8
> calls it "load/store exclusive"), so that read/op/cmpxchg is the closest semi-equivalent sequence.
>
> I'm personally a fan of RMW instructions due to the guaranteed progress and the whole potential cache
> coherency protocol advantage (no need for write intent hints etc). So it makes sense to me.
We've talked about this before here, but LL/SC can guarantee progress (when it is limited like POWER does), and the LL of course always carries a load-for-store signal.
>
> Of course, there might be a code density issue driving this too. It is not unreasonable
> to have reference count updates etc that you want to inline in JIT'ed code, and the
> whole "loop over load-locked/add/store-conditional" model is just damn painful for that.
> So there might certainly be reasons to do the atomics even in small devices.
Yes I wonder. I would not have thought they are so common as to make a significant difference (i.e., well under 1% of dynamic instructions executed), but maybe my thinking is out of date nowadays that atomic operations don't cost a hundred cycles.
I wonder if power is another motivation. Theoretically if the instructions don't have side effects that depend on the value, you could export such operations to remote owner of the cacheline in your CC protocol, or to a memory controller if the cacheline is not owned, without blocking the core on the read.
>
> Which makes me wonder: the docs say that the instructions "also include controls associated
> with influencing the order properties" (good - the memory ordering requirement for an atomic
> that gets a reference count can be very different from the memory ordering of an atomic that
> just increments some statistics), but there are cases where you don't even care about SMP
> atomicity, you just want atomicity wrt interrupts or even just smaller code.
>
> So I wonder if the "order properties" include that kind of "UP-only interrupt
> atomicity" ordering that isn't even SMP-safe but is potentially cheaper..
>
> Linus
> Andreas (kingmouf.delete@this.gmail.com) on December 3, 2014 6:51 am wrote:
> >
> > I think that we should realize that the world does not spin only around iDevices and Apple. In my point
> > of view this is more geared towards larger systems and servers rather than tablets and phones.
>
> Yes. From what I've seen, the advantage of atomic RMW ops is under heavy contention,
> when they are better at making progress than the equivalent "read/op/cmpxchg" loop.
>
> Obviously, I can compare mainly against x86, since that is the only other relevant architecture that
> has RMW operations. And x86 doesn't have the whole "load-linked" and "store-conditional" model (ARMv8
> calls it "load/store exclusive"), so that read/op/cmpxchg is the closest semi-equivalent sequence.
>
> I'm personally a fan of RMW instructions due to the guaranteed progress and the whole potential cache
> coherency protocol advantage (no need for write intent hints etc). So it makes sense to me.
We've talked about this before here, but LL/SC can guarantee progress (when it is limited like POWER does), and the LL of course always carries a load-for-store signal.
>
> Of course, there might be a code density issue driving this too. It is not unreasonable
> to have reference count updates etc that you want to inline in JIT'ed code, and the
> whole "loop over load-locked/add/store-conditional" model is just damn painful for that.
> So there might certainly be reasons to do the atomics even in small devices.
Yes I wonder. I would not have thought they are so common as to make a significant difference (i.e., well under 1% of dynamic instructions executed), but maybe my thinking is out of date nowadays that atomic operations don't cost a hundred cycles.
I wonder if power is another motivation. Theoretically if the instructions don't have side effects that depend on the value, you could export such operations to remote owner of the cacheline in your CC protocol, or to a memory controller if the cacheline is not owned, without blocking the core on the read.
>
> Which makes me wonder: the docs say that the instructions "also include controls associated
> with influencing the order properties" (good - the memory ordering requirement for an atomic
> that gets a reference count can be very different from the memory ordering of an atomic that
> just increments some statistics), but there are cases where you don't even care about SMP
> atomicity, you just want atomicity wrt interrupts or even just smaller code.
>
> So I wonder if the "order properties" include that kind of "UP-only interrupt
> atomicity" ordering that isn't even SMP-safe but is potentially cheaper..
>
> Linus