By: anon (anon.delete@this.anon.com), July 12, 2015 3:42 am
Room: Moderated Discussions
Simon Farnsworth (simon.delete@this.farnz.org.uk) on July 10, 2015 1:07 pm wrote:
> dmcq (dmcq.delete@this.fano.co.uk) on July 9, 2015 4:24 pm wrote:
> On the other hand, the hardware doesn't actually do "memory barrier" operations directly; it passes
> messages around in the cache coherency protocol (MOESI or similar).
This is not true. Depending on the interconnect and cache coherency rules, a significant amount / most reordering is done by inside the CPU core.
Significantly, loads pass other memory operations due to out of order execution or non blocking loads (load/load reordering can be hidden from ISA by speculation in complex cores, but apparently not load/store). And stores like to pass other stores before reaching the cache coherency layer (in case your newer store does not have cacheline exclusive), and to a lesser extent with blocked loads.
So all of that happens inside the core. Memory ordering instructions have to prevent these reorderings within the core.
> If you really want to make the
> hardware's life simple (so that it can really scream), you'd surely push back and make software decide
> exactly which cache coherency messages it wants to send and when it forces write back from cache
> to RAM - indeed, in some senses, a Cell SPU forced the developers to do exactly that.
>
> The fact that we prefer to hide that simple approach underneath memory barriers and hardware
> cache controls suggests that we'd prefer to constrain the hardware to make the programming
> model easier to grasp - look at how hard developers found it to exploit the Cell SPUs.
> dmcq (dmcq.delete@this.fano.co.uk) on July 9, 2015 4:24 pm wrote:
> On the other hand, the hardware doesn't actually do "memory barrier" operations directly; it passes
> messages around in the cache coherency protocol (MOESI or similar).
This is not true. Depending on the interconnect and cache coherency rules, a significant amount / most reordering is done by inside the CPU core.
Significantly, loads pass other memory operations due to out of order execution or non blocking loads (load/load reordering can be hidden from ISA by speculation in complex cores, but apparently not load/store). And stores like to pass other stores before reaching the cache coherency layer (in case your newer store does not have cacheline exclusive), and to a lesser extent with blocked loads.
So all of that happens inside the core. Memory ordering instructions have to prevent these reorderings within the core.
> If you really want to make the
> hardware's life simple (so that it can really scream), you'd surely push back and make software decide
> exactly which cache coherency messages it wants to send and when it forces write back from cache
> to RAM - indeed, in some senses, a Cell SPU forced the developers to do exactly that.
>
> The fact that we prefer to hide that simple approach underneath memory barriers and hardware
> cache controls suggests that we'd prefer to constrain the hardware to make the programming
> model easier to grasp - look at how hard developers found it to exploit the Cell SPUs.