By: Simon Farnsworth (simon.delete@this.farnz.org.uk), July 12, 2015 1:02 pm
Room: Moderated Discussions
dmcq (dmcq.delete@this.fano.co.uk) on July 10, 2015 3:49 pm wrote:
> Simon Farnsworth (simon.delete@this.farnz.org.uk) on July 10, 2015 1:07 pm wrote:
> > dmcq (dmcq.delete@this.fano.co.uk) on July 9, 2015 4:24 pm wrote:
> > > Linus Torvalds (torvalds.delete@this.linux-foundation.org) on July 9, 2015 12:14 pm wrote:
> > > > Maynard Handley (name99.delete@this.name99.org) on July 9, 2015 10:12 am wrote:
> > > > >
> > > > > In a talk by Leslie Lamport at Stanford a few months ago, he claimed that formal verification of
> > > > > an implementation of the POWER memory model had caught a bug that testing would not have caught.
> > > > > You can read this as "the POWER memory model sucks" or you can read this as "this
> > > > > sort of thing should be handled by formal verification rather than testing"...
> > > >
> > > > I think formal verification is absolutely the way to go when it comes
> > > > to memory ordering. Because it is too damn hard any other way.
> > > >
> > > > However, I happen to believe in formal verification as a solution for things like cache
> > > > protocol definitions (where it is already used) and hardware (where I think it's used
> > > > at some levels). Make the rules clear, understandable and unambiguous (and very much
> > > > not surprising), and show that the hardware actually follows the rules.
> > > >
> > > > But formal verification is not necessarily nearly as useful for software.
> > > >
> > > > I used to think that it was completely unrealistic for software, but it turns out that it's at
> > > > least somewhat possible to maybe not really "formally" verify these things, but at least automate
> > > > a lot of memory access verification, and there are groups doing some very impressive things.
> > > >
> > > > The software verification is often with adding annotations or simplifying the problem space other
> > > > ways, though - so the "verification" is not necessarily verifying the actual totality code, but
> > > > a simplified representation or sub-case. In most cases the verification is run over traces of code
> > > > (running on different CPU's) and verifying that those traces follow the rules. So it may not be
> > > > formally verifying the whole thing, but at least it's verifying something, preferably against
> > > > the actual hardware rules that have also been verified to be honored by the hardware.
> > > >
> > > > But that kind of software verification will not be run on most software. There are groups using
> > > > Linux for their work (and presumably other systems too, I just wouldn't know about it), and there
> > > > are groups (often the same group) looking at things like core system libraries. So hopefully there
> > > > will be coverage of the really core system code, and hopefully at least the core locking primitives
> > > > have been verified to work, and some really basic libraries are using them correctly.
> > > >
> > > > But verifying software (even with the caveats mentioned above) formally in general? Yeah, not
> > > > going to happen. People will forget to take locks, and people will create their own ad-hoc synchronization
> > > > models, and they will get them wrong. And they will "work". Most of the time.
> > > >
> > > > And the subtler and the less intuitive the hardware rules are, the more likely it will be that they will
> > > > work fine on one piece of hardware (inevitably the one
> > > > that the developer has, because that's what he tested
> > > > his fancy new idea on), but then result in very odd and occasional problems on other implementations.
> > > >
> > > > Tightening down the hardware rules really does help enormously
> > > > with problems like that. The code may be buggy
> > > > as hell, but at least what you tested on one machine tends
> > > > to be reasonably close to what you run three generations
> > > > later and your binary still works, rather than mysteriously crashing in ways you can't reproduce.
> > > >
> > > > And btw, this is not at all a new issue. The whole "one CPU does something subtly different from another
> > > > CPU because it wasn't well-defined in the architecture" is age-old, and it has been a problem before.
> > > >
> > > > Think "undefined flags after operations" kind of things (x86 had several cases of that), or unused
> > > > high bits in pointer uses (m68k), or incomplete instruction decoding (6502) or silent odd behavior
> > > > with unaligned addresses (ARM) or any number of those under-specified hardware issues that people
> > > > ended up using even though they often weren't documented (and then people had to document them).
> > > >
> > > > HW designers have largely learnt not to do that any more. Weak memory ordering isn't really all that
> > > > different. It's simply worth making the definition stricter (and using extra hardware to enforce
> > > > the stricter rules) to avoid problems down the line in the next microarchitecture when you changed
> > > > something fundamentally internally but don't want that change to break existing applications.
> > > >
> > > > Linus
> > >
> > > If every system supported sequential consistency and was clearly defined to it would exacerbate
> > > the problems not solve them. It would not solve most of the problems and it would encourage
> > > intelligent programmers to devise yet more tricks. And as to your contention that Linux supports
> > > just having a weak model with data dependency there are Linux documents clearly saying data
> > > dependency is needed in Linux without mentioning rcu or particular processors.
> > >
> > > There are going to be more problems in the future not less, it is not a question of fixing the bugs but
> > > stopping new ones. Instead of having all the processors going at the same speed you have different speeds
> > > as in ARM systems. Far more processors will be used and the relative distance between them considering
> > > their speed will become much greater. GPUs are acting as associated processors and network processing
> > > wants data passed around easily. The hardware needs a simple model and sequential consistency is not that.
> > > It is just a CS model and not something for humans to program with. Any sequential consistency should
> > > be in marked out blocks of code that users can think about and anything more is just trouble.
> > >
> > > Yes loads of unaligned data is the way everyone does it. That came mainly because of Linux too
> > > from the x86 implementation. It was also found to be useful in implementing string operations
> > > like memcpy. But we would have been better off if it had just faulted and people had to use
> > > operations that were explicitly for unaligned load. It would mean less bugs in the code.
> > >
> >
> > Why would we be better off? Surely having one load and one store instruction that always work
> > and happen to be fast if they're aligned makes for simpler, less buggy code, than having to explicitly
> > choose fast aligned or slow unaligned, and risk bugs if you choose the wrong one?
>
> Always work for what? I don't think you have a clear picture in your mind about what is being done.
>
I mean that instead of having separate operations for "load aligned" and "load unaligned", there should only be one "load" operation, which happens to run faster if you meet alignment requirements.
In other words, I fundametnally disagree with "But we would have been better off if it had just faulted and people had to use operations that were explicitly for unaligned load." I think that's a recipe for hard to find and hard to fix bugs, as compared to a system where there is only one load operation, which works for both unaligned and aligned loads, and (if hardware design requires it) is simply faster on aligned loads.
> > It's like NUMA in that respect - there's nothing about a NUMA system that can't be implemented atop
> > uniform memory access nodes connected into a cluster via a message passing interface (because that's
> > what the hardware is underneath a NUMA memory subsystem), but the effect is that on the cluster,
> > software fails to produce the correct results if it doesn't get the message passing right (and is
> > prone to design faults like "always send and receive the message, even if it's going to loop back
> > to the same node), whereas a NUMA system runs at full tilt if you're careful to stay on the same
> > node as far as possible, and simply slows down if you need cross-node consistency.
>
> I don't see the relevance of this.
>
NUMA is a case of "make the hardware more complicated, so that software has fewer things to get right." You're arguing for making the hardware simpler, at the expense of giving software more things to get right (consistency, alignment).
Broadly speaking, I'm working from the following two axioms:
NUMA, all loads work even if unaligned, and strong consistency models are all ways for hardware to convert correctness bugs in software into performance bugs.
> > > Anyway sequential consistency or anything like it is not going to be implemented as it does put in
> > > too much of a constraint on the hardware for no particular advantage. You can think of it as not caring
> > > about bugs but sequential consistency in hardware fo all memory operations is just a very bad idea
> > > for both hardware and software. Think of it if you like as the simplicity of RISC which meant we don't
> > > have interrupts on overflows or automatic bounds checks on array accesses - now there's an area where
> > > if we really cared about bugs rather than performance we would have done things differently.
> >
> > On the other hand, the hardware doesn't actually do "memory barrier" operations directly; it passes
> > messages around in the cache coherency protocol (MOESI or similar). If you really want to make the
> > hardware's life simple (so that it can really scream), you'd surely push back and make software decide
> > exactly which cache coherency messages it wants to send and when it forces write back from cache
> > to RAM - indeed, in some senses, a Cell SPU forced the developers to do exactly that.
>
> I'm well aware of the normal ways that coherency is achieved. No my main concern is that code should be
> bug free I think guaranteeing total order or sequential consistency or even data dependency just encourages
> bugs. Acquire release operations and some atomics can do practically everything well and avoid the bugs.
> I am not saying we should make life difficult for the software or do things in a complicated way.
>
But sequential consistency is easier for software than weaker consistency models, as is only having one load operation that works as well as possible for all types of load. If you're talking about having separate unaligned and aligned load operations, and weaker consistency models, you're making things harder for the software, and making the software more complicated; I argue that to do so needs very strong justification.
In the case of sequential consistency, the argument is that the consistency traffic makes performance impossible in a parallel system; therefore, you need to weaken it. TSO (the x86 model) is still easy for software to reason about, needing very few barrier operations (and those mostly in core libraries, not in user code) to get a memory model that most users can reason about.
> > The fact that we prefer to hide that simple approach underneath memory barriers and hardware
> > cache controls suggests that we'd prefer to constrain the hardware to make the programming
> > model easier to grasp - look at how hard developers found it to exploit the Cell SPUs.
>
> By simple you seem to mean some sort of direct hardware control judging from
> what you just said before but I'm not certain that's what you mean by simple
> here. I don't think the Cell processors are relevant to anything I have said.
>
The direct hardware control is as simple as possible for the hardware to implement - not only that, but it's an even weaker consistency model than Alpha (as it has no consistency at all, other than that imposed by software driving a consistency protocol). If you're really arguing that simple hardware is better, why should the hardware choose any consistency protocol, when it can expose the building blocks and let software handle it?
> Simon Farnsworth (simon.delete@this.farnz.org.uk) on July 10, 2015 1:07 pm wrote:
> > dmcq (dmcq.delete@this.fano.co.uk) on July 9, 2015 4:24 pm wrote:
> > > Linus Torvalds (torvalds.delete@this.linux-foundation.org) on July 9, 2015 12:14 pm wrote:
> > > > Maynard Handley (name99.delete@this.name99.org) on July 9, 2015 10:12 am wrote:
> > > > >
> > > > > In a talk by Leslie Lamport at Stanford a few months ago, he claimed that formal verification of
> > > > > an implementation of the POWER memory model had caught a bug that testing would not have caught.
> > > > > You can read this as "the POWER memory model sucks" or you can read this as "this
> > > > > sort of thing should be handled by formal verification rather than testing"...
> > > >
> > > > I think formal verification is absolutely the way to go when it comes
> > > > to memory ordering. Because it is too damn hard any other way.
> > > >
> > > > However, I happen to believe in formal verification as a solution for things like cache
> > > > protocol definitions (where it is already used) and hardware (where I think it's used
> > > > at some levels). Make the rules clear, understandable and unambiguous (and very much
> > > > not surprising), and show that the hardware actually follows the rules.
> > > >
> > > > But formal verification is not necessarily nearly as useful for software.
> > > >
> > > > I used to think that it was completely unrealistic for software, but it turns out that it's at
> > > > least somewhat possible to maybe not really "formally" verify these things, but at least automate
> > > > a lot of memory access verification, and there are groups doing some very impressive things.
> > > >
> > > > The software verification is often with adding annotations or simplifying the problem space other
> > > > ways, though - so the "verification" is not necessarily verifying the actual totality code, but
> > > > a simplified representation or sub-case. In most cases the verification is run over traces of code
> > > > (running on different CPU's) and verifying that those traces follow the rules. So it may not be
> > > > formally verifying the whole thing, but at least it's verifying something, preferably against
> > > > the actual hardware rules that have also been verified to be honored by the hardware.
> > > >
> > > > But that kind of software verification will not be run on most software. There are groups using
> > > > Linux for their work (and presumably other systems too, I just wouldn't know about it), and there
> > > > are groups (often the same group) looking at things like core system libraries. So hopefully there
> > > > will be coverage of the really core system code, and hopefully at least the core locking primitives
> > > > have been verified to work, and some really basic libraries are using them correctly.
> > > >
> > > > But verifying software (even with the caveats mentioned above) formally in general? Yeah, not
> > > > going to happen. People will forget to take locks, and people will create their own ad-hoc synchronization
> > > > models, and they will get them wrong. And they will "work". Most of the time.
> > > >
> > > > And the subtler and the less intuitive the hardware rules are, the more likely it will be that they will
> > > > work fine on one piece of hardware (inevitably the one
> > > > that the developer has, because that's what he tested
> > > > his fancy new idea on), but then result in very odd and occasional problems on other implementations.
> > > >
> > > > Tightening down the hardware rules really does help enormously
> > > > with problems like that. The code may be buggy
> > > > as hell, but at least what you tested on one machine tends
> > > > to be reasonably close to what you run three generations
> > > > later and your binary still works, rather than mysteriously crashing in ways you can't reproduce.
> > > >
> > > > And btw, this is not at all a new issue. The whole "one CPU does something subtly different from another
> > > > CPU because it wasn't well-defined in the architecture" is age-old, and it has been a problem before.
> > > >
> > > > Think "undefined flags after operations" kind of things (x86 had several cases of that), or unused
> > > > high bits in pointer uses (m68k), or incomplete instruction decoding (6502) or silent odd behavior
> > > > with unaligned addresses (ARM) or any number of those under-specified hardware issues that people
> > > > ended up using even though they often weren't documented (and then people had to document them).
> > > >
> > > > HW designers have largely learnt not to do that any more. Weak memory ordering isn't really all that
> > > > different. It's simply worth making the definition stricter (and using extra hardware to enforce
> > > > the stricter rules) to avoid problems down the line in the next microarchitecture when you changed
> > > > something fundamentally internally but don't want that change to break existing applications.
> > > >
> > > > Linus
> > >
> > > If every system supported sequential consistency and was clearly defined to it would exacerbate
> > > the problems not solve them. It would not solve most of the problems and it would encourage
> > > intelligent programmers to devise yet more tricks. And as to your contention that Linux supports
> > > just having a weak model with data dependency there are Linux documents clearly saying data
> > > dependency is needed in Linux without mentioning rcu or particular processors.
> > >
> > > There are going to be more problems in the future not less, it is not a question of fixing the bugs but
> > > stopping new ones. Instead of having all the processors going at the same speed you have different speeds
> > > as in ARM systems. Far more processors will be used and the relative distance between them considering
> > > their speed will become much greater. GPUs are acting as associated processors and network processing
> > > wants data passed around easily. The hardware needs a simple model and sequential consistency is not that.
> > > It is just a CS model and not something for humans to program with. Any sequential consistency should
> > > be in marked out blocks of code that users can think about and anything more is just trouble.
> > >
> > > Yes loads of unaligned data is the way everyone does it. That came mainly because of Linux too
> > > from the x86 implementation. It was also found to be useful in implementing string operations
> > > like memcpy. But we would have been better off if it had just faulted and people had to use
> > > operations that were explicitly for unaligned load. It would mean less bugs in the code.
> > >
> >
> > Why would we be better off? Surely having one load and one store instruction that always work
> > and happen to be fast if they're aligned makes for simpler, less buggy code, than having to explicitly
> > choose fast aligned or slow unaligned, and risk bugs if you choose the wrong one?
>
> Always work for what? I don't think you have a clear picture in your mind about what is being done.
>
I mean that instead of having separate operations for "load aligned" and "load unaligned", there should only be one "load" operation, which happens to run faster if you meet alignment requirements.
In other words, I fundametnally disagree with "But we would have been better off if it had just faulted and people had to use operations that were explicitly for unaligned load." I think that's a recipe for hard to find and hard to fix bugs, as compared to a system where there is only one load operation, which works for both unaligned and aligned loads, and (if hardware design requires it) is simply faster on aligned loads.
> > It's like NUMA in that respect - there's nothing about a NUMA system that can't be implemented atop
> > uniform memory access nodes connected into a cluster via a message passing interface (because that's
> > what the hardware is underneath a NUMA memory subsystem), but the effect is that on the cluster,
> > software fails to produce the correct results if it doesn't get the message passing right (and is
> > prone to design faults like "always send and receive the message, even if it's going to loop back
> > to the same node), whereas a NUMA system runs at full tilt if you're careful to stay on the same
> > node as far as possible, and simply slows down if you need cross-node consistency.
>
> I don't see the relevance of this.
>
NUMA is a case of "make the hardware more complicated, so that software has fewer things to get right." You're arguing for making the hardware simpler, at the expense of giving software more things to get right (consistency, alignment).
Broadly speaking, I'm working from the following two axioms:
- Easy to reproduce bugs are quicker to fix than hard to reproduce bugs.
- Performance bugs are less important than correctness bugs - you can live with a performance bug if it's rare enough (or your hardware budget allows). You cannot live with a correctness bug.
NUMA, all loads work even if unaligned, and strong consistency models are all ways for hardware to convert correctness bugs in software into performance bugs.
> > > Anyway sequential consistency or anything like it is not going to be implemented as it does put in
> > > too much of a constraint on the hardware for no particular advantage. You can think of it as not caring
> > > about bugs but sequential consistency in hardware fo all memory operations is just a very bad idea
> > > for both hardware and software. Think of it if you like as the simplicity of RISC which meant we don't
> > > have interrupts on overflows or automatic bounds checks on array accesses - now there's an area where
> > > if we really cared about bugs rather than performance we would have done things differently.
> >
> > On the other hand, the hardware doesn't actually do "memory barrier" operations directly; it passes
> > messages around in the cache coherency protocol (MOESI or similar). If you really want to make the
> > hardware's life simple (so that it can really scream), you'd surely push back and make software decide
> > exactly which cache coherency messages it wants to send and when it forces write back from cache
> > to RAM - indeed, in some senses, a Cell SPU forced the developers to do exactly that.
>
> I'm well aware of the normal ways that coherency is achieved. No my main concern is that code should be
> bug free I think guaranteeing total order or sequential consistency or even data dependency just encourages
> bugs. Acquire release operations and some atomics can do practically everything well and avoid the bugs.
> I am not saying we should make life difficult for the software or do things in a complicated way.
>
But sequential consistency is easier for software than weaker consistency models, as is only having one load operation that works as well as possible for all types of load. If you're talking about having separate unaligned and aligned load operations, and weaker consistency models, you're making things harder for the software, and making the software more complicated; I argue that to do so needs very strong justification.
In the case of sequential consistency, the argument is that the consistency traffic makes performance impossible in a parallel system; therefore, you need to weaken it. TSO (the x86 model) is still easy for software to reason about, needing very few barrier operations (and those mostly in core libraries, not in user code) to get a memory model that most users can reason about.
> > The fact that we prefer to hide that simple approach underneath memory barriers and hardware
> > cache controls suggests that we'd prefer to constrain the hardware to make the programming
> > model easier to grasp - look at how hard developers found it to exploit the Cell SPUs.
>
> By simple you seem to mean some sort of direct hardware control judging from
> what you just said before but I'm not certain that's what you mean by simple
> here. I don't think the Cell processors are relevant to anything I have said.
>
The direct hardware control is as simple as possible for the hardware to implement - not only that, but it's an even weaker consistency model than Alpha (as it has no consistency at all, other than that imposed by software driving a consistency protocol). If you're really arguing that simple hardware is better, why should the hardware choose any consistency protocol, when it can expose the building blocks and let software handle it?