By: dmcq (dmcq.delete@this.fano.co.uk), July 14, 2015 2:54 pm
Room: Moderated Discussions
Simon Farnsworth (simon.delete@this.farnz.org.uk) on July 14, 2015 1:21 pm wrote:
> dmcq (dmcq.delete@this.fano.co.uk) on July 13, 2015 5:03 am wrote:
> > Simon Farnsworth (simon.delete@this.farnz.org.uk) on July 12, 2015 1:02 pm wrote:
> > > dmcq (dmcq.delete@this.fano.co.uk) on July 10, 2015 3:49 pm wrote:
> > > > Simon Farnsworth (simon.delete@this.farnz.org.uk) on July 10, 2015 1:07 pm wrote:
> > > > > dmcq (dmcq.delete@this.fano.co.uk) on July 9, 2015 4:24 pm wrote:
> > > > > > Linus Torvalds (torvalds.delete@this.linux-foundation.org) on July 9, 2015 12:14 pm wrote:
> > > > > > > Maynard Handley (name99.delete@this.name99.org) on July 9, 2015 10:12 am wrote:
> > > > > > > >
> > > > > > > > In a talk by Leslie Lamport at Stanford a few months ago, he claimed that formal verification of
> > > > > > > > an implementation of the POWER memory model had caught a bug that testing would not have caught.
> > > > > > > > You can read this as "the POWER memory model sucks" or you can read this as "this
> > > > > > > > sort of thing should be handled by formal verification rather than testing"...
> > > > > > >
> > > > > > > I think formal verification is absolutely the way to go when it comes
> > > > > > > to memory ordering. Because it is too damn hard any other way.
> > > > > > >
> > > > > > > However, I happen to believe in formal verification as a solution for things like cache
> > > > > > > protocol definitions (where it is already used) and hardware (where I think it's used
> > > > > > > at some levels). Make the rules clear, understandable and unambiguous (and very much
> > > > > > > not surprising), and show that the hardware actually follows the rules.
> > > > > > >
> > > > > > > But formal verification is not necessarily nearly as useful for software.
> > > > > > >
> > > > > > > I used to think that it was completely unrealistic for software, but it turns out that it's at
> > > > > > > least somewhat possible to maybe not really "formally" verify these things, but at least automate
> > > > > > > a lot of memory access verification, and there are groups doing some very impressive things.
> > > > > > >
> > > > > > > The software verification is often with adding annotations or simplifying the problem space other
> > > > > > > ways, though - so the "verification" is not necessarily verifying the actual totality code, but
> > > > > > > a simplified representation or sub-case. In most cases the verification is run over traces of code
> > > > > > > (running on different CPU's) and verifying that those traces follow the rules. So it may not be
> > > > > > > formally verifying the whole thing, but at least it's verifying something, preferably against
> > > > > > > the actual hardware rules that have also been verified to be honored by the hardware.
> > > > > > >
> > > > > > > But that kind of software verification will not be run on most software. There are groups using
> > > > > > > Linux for their work (and presumably other systems too, I just wouldn't know about it), and there
> > > > > > > are groups (often the same group) looking at things like core system libraries. So hopefully there
> > > > > > > will be coverage of the really core system code, and hopefully at least the core locking primitives
> > > > > > > have been verified to work, and some really basic libraries are using them correctly.
> > > > > > >
> > > > > > > But verifying software (even with the caveats mentioned above) formally in general? Yeah, not
> > > > > > > going to happen. People will forget to take locks, and people will create their own ad-hoc synchronization
> > > > > > > models, and they will get them wrong. And they will "work". Most of the time.
> > > > > > >
> > > > > > > And the subtler and the less intuitive the hardware rules are, the more likely it will be that they will
> > > > > > > work fine on one piece of hardware (inevitably the one
> > > > > > > that the developer has, because that's what he tested
> > > > > > > his fancy new idea on), but then result in very odd and occasional problems on other implementations.
> > > > > > >
> > > > > > > Tightening down the hardware rules really does help enormously
> > > > > > > with problems like that. The code may be buggy
> > > > > > > as hell, but at least what you tested on one machine tends
> > > > > > > to be reasonably close to what you run three generations
> > > > > > > later and your binary still works, rather than mysteriously crashing in ways you can't reproduce.
> > > > > > >
> > > > > > > And btw, this is not at all a new issue. The whole "one CPU does something subtly different from another
> > > > > > > CPU because it wasn't well-defined in the architecture" is age-old, and it has been a problem before.
> > > > > > >
> > > > > > > Think "undefined flags after operations" kind of things (x86 had several cases of that), or unused
> > > > > > > high bits in pointer uses (m68k), or incomplete instruction decoding (6502) or silent odd behavior
> > > > > > > with unaligned addresses (ARM) or any number of those under-specified hardware issues that people
> > > > > > > ended up using even though they often weren't documented (and then people had to document them).
> > > > > > >
> > > > > > > HW designers have largely learnt not to do that any more. Weak memory ordering isn't really all that
> > > > > > > different. It's simply worth making the definition stricter (and using extra hardware to enforce
> > > > > > > the stricter rules) to avoid problems down the line in the next microarchitecture when you changed
> > > > > > > something fundamentally internally but don't want that change to break existing applications.
> > > > > > >
> > > > > > > Linus
> > > > > >
> > > > > > If every system supported sequential consistency and was clearly defined to it would exacerbate
> > > > > > the problems not solve them. It would not solve most of the problems and it would encourage
> > > > > > intelligent programmers to devise yet more tricks. And as to your contention that Linux supports
> > > > > > just having a weak model with data dependency there are Linux documents clearly saying data
> > > > > > dependency is needed in Linux without mentioning rcu or particular processors.
> > > > > >
> > > > > > There are going to be more problems in the future not less, it is not a question of fixing the bugs but
> > > > > > stopping new ones. Instead of having all the processors going at the same speed you have different speeds
> > > > > > as in ARM systems. Far more processors will be used and the relative distance between them considering
> > > > > > their speed will become much greater. GPUs are acting as associated processors and network processing
> > > > > > wants data passed around easily. The hardware needs a simple model and sequential consistency is not that.
> > > > > > It is just a CS model and not something for humans to program with. Any sequential consistency should
> > > > > > be in marked out blocks of code that users can think about and anything more is just trouble.
> > > > > >
> > > > > > Yes loads of unaligned data is the way everyone does it. That came mainly because of Linux too
> > > > > > from the x86 implementation. It was also found to be useful in implementing string operations
> > > > > > like memcpy. But we would have been better off if it had just faulted and people had to use
> > > > > > operations that were explicitly for unaligned load. It would mean less bugs in the code.
> > > > > >
> > > > >
> > > > > Why would we be better off? Surely having one load and one store instruction that always work
> > > > > and happen to be fast if they're aligned makes for simpler, less buggy code, than having to explicitly
> > > > > choose fast aligned or slow unaligned, and risk bugs if you choose the wrong one?
> > > >
> > > > Always work for what? I don't think you have a clear picture in your mind about what is being done.
> > > >
> > >
> > > I mean that instead of having separate operations for "load aligned" and "load unaligned", there should
> > > only be one "load" operation, which happens to run faster if you meet alignment requirements.
> > >
> > > In other words, I fundametnally disagree with "But we would have been better off if it had just faulted and
> > > people had to use operations that were explicitly for unaligned load." I think that's a recipe for hard to
> > > find and hard to fix bugs, as compared to a system where there is only one load operation, which works for
> > > both unaligned and aligned loads, and (if hardware design requires it) is simply faster on aligned loads.
> >
> > On the basis of your argument not having bounds checks was a wonderful way to get rid of bugs. I'm sure
> > lots of programs are running nowadays that wouldn't otherwise because they would have bound check errors.
> > The only way that loading and storing integers on odd word boundaries is a good idea is if the intention
> > of the user is to do that. No programming language by default does that so it is a strange requirement.
> >
>
> You've misunderstood my argument, then. My argument is that a software author is faced with two choices:
>
>
>
> Given that option 1 is slow even when people get it right, you'd expect software engineers who care about performance
> at all to write option 2. However, option 2 is simply "is the input aligned? If so, use fast instruction for
> aligned read, else use slow instruction for unaligned read" - and I don't see why that burden can't be passed
> down to the hardware. Then, when the performance is an issue, you can run your profiler, discover the unaligned
> accesses and fix them then; if you only ever have the IP packet embedded in an Ethernet packet when CPU time
> is cheap, then you never need fix the alignment (possibly needing a change of Ethernet controller).
>
> Removing bounds checks is a correctness issue; if you remove the bounds check, the program now crashes when faced
> with bad input. That's a separate argument - and I would suggest that you're actually the one arguing that bounds
> checks are unnecessary, because programs that do bad things on bad data will have their bugs fixed.
>
> > > > > It's like NUMA in that respect - there's nothing about a NUMA system that can't be implemented atop
> > > > > uniform memory access nodes connected into a cluster via a message passing interface (because that's
> > > > > what the hardware is underneath a NUMA memory subsystem), but the effect is that on the cluster,
> > > > > software fails to produce the correct results if it doesn't get the message passing right (and is
> > > > > prone to design faults like "always send and receive the message, even if it's going to loop back
> > > > > to the same node), whereas a NUMA system runs at full tilt if you're careful to stay on the same
> > > > > node as far as possible, and simply slows down if you need cross-node consistency.
> > > >
> > > > I don't see the relevance of this.
> > > >
> > >
> > > NUMA is a case of "make the hardware more complicated, so that software has fewer
> > > things to get right." You're arguing for making the hardware simpler, at the expense
> > > of giving software more things to get right (consistency, alignment).
> > >
> > > Broadly speaking, I'm working from the following two axioms:
> > >
> > >
> > >
> > > NUMA, all loads work even if unaligned, and strong consistency models are all ways
> > > for hardware to convert correctness bugs in software into performance bugs.
> >
> > Exactly where has this requirement come from for loading 4 byte unaligned strings
> > into registers but not 3 byte or 5 byte ones? Don't you think it would be better
> > if bugs were actually caught or even better not written in the first place?
> >
>
> Well, for example, I'm loading a 4 byte quantity from 4 bytes into the payload of an
> IP packet (TCP sequence number). I don't realise that another component has aligned the
> Ethernet frame instead of the IP frame, and thus it's not safe for me to do that.
>
> And no, I don't see why it's better for the other component to misalign the Ethernet frame
> for the benefit of later IP payload processing - it doesn't necessarily know that the IP payload
> is going to be interesting later, and would not be being unreasonable if it believed that
> the Ethernet packet itself was more interesting than the payload of its payload.
>
> > > > > > Anyway sequential consistency or anything like it is not going to be implemented as it does put in
> > > > > > too much of a constraint on the hardware for no particular advantage. You can think of it as not caring
> > > > > > about bugs but sequential consistency in hardware fo all memory operations is just a very bad idea
> > > > > > for both hardware and software. Think of it if you like as the simplicity of RISC which meant we don't
> > > > > > have interrupts on overflows or automatic bounds checks on array accesses - now there's an area where
> > > > > > if we really cared about bugs rather than performance we would have done things differently.
> > > > >
> > > > > On the other hand, the hardware doesn't actually do "memory barrier" operations directly; it passes
> > > > > messages around in the cache coherency protocol (MOESI or similar). If you really want to make the
> > > > > hardware's life simple (so that it can really scream), you'd surely push back and make software decide
> > > > > exactly which cache coherency messages it wants to send and when it forces write back from cache
> > > > > to RAM - indeed, in some senses, a Cell SPU forced the developers to do exactly that.
> > > >
> > > > I'm well aware of the normal ways that coherency is achieved. No my main concern is that code should be
> > > > bug free I think guaranteeing total order or sequential consistency or even data dependency just encourages
> > > > bugs. Acquire release operations and some atomics can do practically everything well and avoid the bugs.
> > > > I am not saying we should make life difficult for the software or do things in a complicated way.
> > > >
> > >
> > > But sequential consistency is easier for software than
> > > weaker consistency models, as is only having one load
> > > operation that works as well as possible for all types of
> > > load. If you're talking about having separate unaligned
> > > and aligned load operations, and weaker consistency models,
> > > you're making things harder for the software, and
> > > making the software more complicated; I argue that to do so needs very strong justification.
> > >
> > > In the case of sequential consistency, the argument is that the consistency traffic makes performance
> > > impossible in a parallel system; therefore, you need to weaken it. TSO (the x86 model) is still
> > > easy for software to reason about, needing very few barrier operations (and those mostly in core
> > > libraries, not in user code) to get a memory model that most users can reason about.
> >
> > Sequential consistency can be used in computer science papers to construct elegant algorithms.
> > If we could ensure that the only use of it was by people who could actually do it properly
> > that would be fine. However that is not what happens in the real world. Computer studies
> > teach loads of people about this and they try to use it in practice. People without a clue
> > read up about it and pat themselves on the back for their cleverness in using it.
> >
> > You seem to be startig with the assumption that there is a special requirement to load 4 byte
> > or eight byte unaligned strings but not 5 byte ones. You are also starting with the assumption
> > that there is some requirement to use what people have learnt about sequential consistency
> > or total order to devise a clever algorithm rather than do things straightforwardly.
> >
> > The real question is whether there is something important that we can't do straightforwardly without
> > some elegant CS algorithm reasoning about total order? If there is then can it be formulated in
> > a way that can be supported in the hardware well without encouraging stupid tricks?
> >
>
> No; I'm assuming that the site at which the mistake is made is likely to be distant from the
> site at which the bug manifests happens (see the Ethernet packet example). Given this, I expect
> debugging the issue to be challenging, as the code that's failing is "obviously" correct.
>
> I would prefer hardware to make debugging software simpler, not harder; it's better for my
> customers to have hardware that's 99% of peak possible performance, on a system where the hardware
> + software stack is bug-free than it is to have hardware that's capable of 100% performance,
> but the software stack has gremlins that manifest as non-reproducible glitches.
>
> > > > > The fact that we prefer to hide that simple approach underneath memory barriers and hardware
> > > > > cache controls suggests that we'd prefer to constrain the hardware to make the programming
> > > > > model easier to grasp - look at how hard developers found it to exploit the Cell SPUs.
> > > >
> > > > By simple you seem to mean some sort of direct hardware control judging from
> > > > what you just said before but I'm not certain that's what you mean by simple
> > > > here. I don't think the Cell processors are relevant to anything I have said.
> > > >
> > > The direct hardware control is as simple as possible for
> > > the hardware to implement - not only that, but it's
> > > an even weaker consistency model than Alpha (as it has no
> > > consistency at all, other than that imposed by software
> > > driving a consistency protocol). If you're really arguing
> > > that simple hardware is better, why should the hardware
> > > choose any consistency protocol, when it can expose the building blocks and let software handle it?
> >
> > The advantage of the Alpha approach is it doesn't encourage people to do stupid things
> > thinking they are clever. It is a side benefit that it is easier for the hardware.
>
> I raise the Alpha specifically because I had to debug code where people did "stupid things thinking they
> are clever" on Alpha-like hardware (an embedded application specific processor), and "knowing" that it worked
> because it was "obviously correct" and "works correctly on the predecessor dual 8051 system".
>
> I also note that you've not discussed the Cell SPU at all; the programmer model for the Cell SPU
> is that you have no direct access to coherent RAM. Instead, you set up transfers via a coherent
> DMA agent to shift data in and out of coherent RAM, and worked on your local memory only; this
> meets your goals, and yet was unpopular in the market - and mostly didn't perform better than
> a more conventional coherent cache core. My conclusion from this is that in large part, people
> won't make the effort to get performance up if they have to consider coherency too deeply.
By that reasoning not having any bounds check is good because somebody is sure to pass over the wrong size sometimes and a program would work correctly without the bounds check but fail with the bounds check.
> dmcq (dmcq.delete@this.fano.co.uk) on July 13, 2015 5:03 am wrote:
> > Simon Farnsworth (simon.delete@this.farnz.org.uk) on July 12, 2015 1:02 pm wrote:
> > > dmcq (dmcq.delete@this.fano.co.uk) on July 10, 2015 3:49 pm wrote:
> > > > Simon Farnsworth (simon.delete@this.farnz.org.uk) on July 10, 2015 1:07 pm wrote:
> > > > > dmcq (dmcq.delete@this.fano.co.uk) on July 9, 2015 4:24 pm wrote:
> > > > > > Linus Torvalds (torvalds.delete@this.linux-foundation.org) on July 9, 2015 12:14 pm wrote:
> > > > > > > Maynard Handley (name99.delete@this.name99.org) on July 9, 2015 10:12 am wrote:
> > > > > > > >
> > > > > > > > In a talk by Leslie Lamport at Stanford a few months ago, he claimed that formal verification of
> > > > > > > > an implementation of the POWER memory model had caught a bug that testing would not have caught.
> > > > > > > > You can read this as "the POWER memory model sucks" or you can read this as "this
> > > > > > > > sort of thing should be handled by formal verification rather than testing"...
> > > > > > >
> > > > > > > I think formal verification is absolutely the way to go when it comes
> > > > > > > to memory ordering. Because it is too damn hard any other way.
> > > > > > >
> > > > > > > However, I happen to believe in formal verification as a solution for things like cache
> > > > > > > protocol definitions (where it is already used) and hardware (where I think it's used
> > > > > > > at some levels). Make the rules clear, understandable and unambiguous (and very much
> > > > > > > not surprising), and show that the hardware actually follows the rules.
> > > > > > >
> > > > > > > But formal verification is not necessarily nearly as useful for software.
> > > > > > >
> > > > > > > I used to think that it was completely unrealistic for software, but it turns out that it's at
> > > > > > > least somewhat possible to maybe not really "formally" verify these things, but at least automate
> > > > > > > a lot of memory access verification, and there are groups doing some very impressive things.
> > > > > > >
> > > > > > > The software verification is often with adding annotations or simplifying the problem space other
> > > > > > > ways, though - so the "verification" is not necessarily verifying the actual totality code, but
> > > > > > > a simplified representation or sub-case. In most cases the verification is run over traces of code
> > > > > > > (running on different CPU's) and verifying that those traces follow the rules. So it may not be
> > > > > > > formally verifying the whole thing, but at least it's verifying something, preferably against
> > > > > > > the actual hardware rules that have also been verified to be honored by the hardware.
> > > > > > >
> > > > > > > But that kind of software verification will not be run on most software. There are groups using
> > > > > > > Linux for their work (and presumably other systems too, I just wouldn't know about it), and there
> > > > > > > are groups (often the same group) looking at things like core system libraries. So hopefully there
> > > > > > > will be coverage of the really core system code, and hopefully at least the core locking primitives
> > > > > > > have been verified to work, and some really basic libraries are using them correctly.
> > > > > > >
> > > > > > > But verifying software (even with the caveats mentioned above) formally in general? Yeah, not
> > > > > > > going to happen. People will forget to take locks, and people will create their own ad-hoc synchronization
> > > > > > > models, and they will get them wrong. And they will "work". Most of the time.
> > > > > > >
> > > > > > > And the subtler and the less intuitive the hardware rules are, the more likely it will be that they will
> > > > > > > work fine on one piece of hardware (inevitably the one
> > > > > > > that the developer has, because that's what he tested
> > > > > > > his fancy new idea on), but then result in very odd and occasional problems on other implementations.
> > > > > > >
> > > > > > > Tightening down the hardware rules really does help enormously
> > > > > > > with problems like that. The code may be buggy
> > > > > > > as hell, but at least what you tested on one machine tends
> > > > > > > to be reasonably close to what you run three generations
> > > > > > > later and your binary still works, rather than mysteriously crashing in ways you can't reproduce.
> > > > > > >
> > > > > > > And btw, this is not at all a new issue. The whole "one CPU does something subtly different from another
> > > > > > > CPU because it wasn't well-defined in the architecture" is age-old, and it has been a problem before.
> > > > > > >
> > > > > > > Think "undefined flags after operations" kind of things (x86 had several cases of that), or unused
> > > > > > > high bits in pointer uses (m68k), or incomplete instruction decoding (6502) or silent odd behavior
> > > > > > > with unaligned addresses (ARM) or any number of those under-specified hardware issues that people
> > > > > > > ended up using even though they often weren't documented (and then people had to document them).
> > > > > > >
> > > > > > > HW designers have largely learnt not to do that any more. Weak memory ordering isn't really all that
> > > > > > > different. It's simply worth making the definition stricter (and using extra hardware to enforce
> > > > > > > the stricter rules) to avoid problems down the line in the next microarchitecture when you changed
> > > > > > > something fundamentally internally but don't want that change to break existing applications.
> > > > > > >
> > > > > > > Linus
> > > > > >
> > > > > > If every system supported sequential consistency and was clearly defined to it would exacerbate
> > > > > > the problems not solve them. It would not solve most of the problems and it would encourage
> > > > > > intelligent programmers to devise yet more tricks. And as to your contention that Linux supports
> > > > > > just having a weak model with data dependency there are Linux documents clearly saying data
> > > > > > dependency is needed in Linux without mentioning rcu or particular processors.
> > > > > >
> > > > > > There are going to be more problems in the future not less, it is not a question of fixing the bugs but
> > > > > > stopping new ones. Instead of having all the processors going at the same speed you have different speeds
> > > > > > as in ARM systems. Far more processors will be used and the relative distance between them considering
> > > > > > their speed will become much greater. GPUs are acting as associated processors and network processing
> > > > > > wants data passed around easily. The hardware needs a simple model and sequential consistency is not that.
> > > > > > It is just a CS model and not something for humans to program with. Any sequential consistency should
> > > > > > be in marked out blocks of code that users can think about and anything more is just trouble.
> > > > > >
> > > > > > Yes loads of unaligned data is the way everyone does it. That came mainly because of Linux too
> > > > > > from the x86 implementation. It was also found to be useful in implementing string operations
> > > > > > like memcpy. But we would have been better off if it had just faulted and people had to use
> > > > > > operations that were explicitly for unaligned load. It would mean less bugs in the code.
> > > > > >
> > > > >
> > > > > Why would we be better off? Surely having one load and one store instruction that always work
> > > > > and happen to be fast if they're aligned makes for simpler, less buggy code, than having to explicitly
> > > > > choose fast aligned or slow unaligned, and risk bugs if you choose the wrong one?
> > > >
> > > > Always work for what? I don't think you have a clear picture in your mind about what is being done.
> > > >
> > >
> > > I mean that instead of having separate operations for "load aligned" and "load unaligned", there should
> > > only be one "load" operation, which happens to run faster if you meet alignment requirements.
> > >
> > > In other words, I fundametnally disagree with "But we would have been better off if it had just faulted and
> > > people had to use operations that were explicitly for unaligned load." I think that's a recipe for hard to
> > > find and hard to fix bugs, as compared to a system where there is only one load operation, which works for
> > > both unaligned and aligned loads, and (if hardware design requires it) is simply faster on aligned loads.
> >
> > On the basis of your argument not having bounds checks was a wonderful way to get rid of bugs. I'm sure
> > lots of programs are running nowadays that wouldn't otherwise because they would have bound check errors.
> > The only way that loading and storing integers on odd word boundaries is a good idea is if the intention
> > of the user is to do that. No programming language by default does that so it is a strange requirement.
> >
>
> You've misunderstood my argument, then. My argument is that a software author is faced with two choices:
>
>
- Use the slow unaligned operations anywhere where it's possible that another component of the system
> has mistakenly given you unaligned arguments. Effectively, this means that every argument has to be
> copied on entry to a function, because it's possible for the final dereference of a pointer to be a
> long way from its origin; for example, a caller could go "oh, this is always the 4th payload byte in
> an IP packet, therefore it's 4 byte aligned", and not be aware that it's been called with a pointer
> to an IP packet embedded in an Ethernet frame, where the start of the Ethernet frame is aligned. You
> will notice that there is no correctness issue arising from using an unaligned read here.
> - Use the fast aligned operations, and check that your callers get it right - this means extra code
> whenever you're passed data from outside to confirm that they've considered alignment properly,
> and switch to a slow path if they haven't, so that when someone passes (ethernet_frame_start + 14)
> to the IPv4 packet handler as ip_packet_start, and it passes (ip_packet_start + IHL * 5) to the
> payload handler, you get an unaligned read, rather than a fast aligned read. Note that the read
> for IHL will be naturally aligned, even if the packet start itself is not, as IHL is a 4 bit field
> on a 4 bit boundary, and I've pushed the packet out by 2 bytes from perfect alignment.
>
>
>
> Given that option 1 is slow even when people get it right, you'd expect software engineers who care about performance
> at all to write option 2. However, option 2 is simply "is the input aligned? If so, use fast instruction for
> aligned read, else use slow instruction for unaligned read" - and I don't see why that burden can't be passed
> down to the hardware. Then, when the performance is an issue, you can run your profiler, discover the unaligned
> accesses and fix them then; if you only ever have the IP packet embedded in an Ethernet packet when CPU time
> is cheap, then you never need fix the alignment (possibly needing a change of Ethernet controller).
>
> Removing bounds checks is a correctness issue; if you remove the bounds check, the program now crashes when faced
> with bad input. That's a separate argument - and I would suggest that you're actually the one arguing that bounds
> checks are unnecessary, because programs that do bad things on bad data will have their bugs fixed.
>
> > > > > It's like NUMA in that respect - there's nothing about a NUMA system that can't be implemented atop
> > > > > uniform memory access nodes connected into a cluster via a message passing interface (because that's
> > > > > what the hardware is underneath a NUMA memory subsystem), but the effect is that on the cluster,
> > > > > software fails to produce the correct results if it doesn't get the message passing right (and is
> > > > > prone to design faults like "always send and receive the message, even if it's going to loop back
> > > > > to the same node), whereas a NUMA system runs at full tilt if you're careful to stay on the same
> > > > > node as far as possible, and simply slows down if you need cross-node consistency.
> > > >
> > > > I don't see the relevance of this.
> > > >
> > >
> > > NUMA is a case of "make the hardware more complicated, so that software has fewer
> > > things to get right." You're arguing for making the hardware simpler, at the expense
> > > of giving software more things to get right (consistency, alignment).
> > >
> > > Broadly speaking, I'm working from the following two axioms:
> > >
> > >
- Easy to reproduce bugs are quicker to fix than hard to reproduce bugs.
- Performance bugs are less important than correctness bugs - you can live with a performance bug
> > > if it's rare enough (or your hardware budget allows). You cannot live with a correctness bug.
> > >
> > >
> > >
> > >
> > > NUMA, all loads work even if unaligned, and strong consistency models are all ways
> > > for hardware to convert correctness bugs in software into performance bugs.
> >
> > Exactly where has this requirement come from for loading 4 byte unaligned strings
> > into registers but not 3 byte or 5 byte ones? Don't you think it would be better
> > if bugs were actually caught or even better not written in the first place?
> >
>
> Well, for example, I'm loading a 4 byte quantity from 4 bytes into the payload of an
> IP packet (TCP sequence number). I don't realise that another component has aligned the
> Ethernet frame instead of the IP frame, and thus it's not safe for me to do that.
>
> And no, I don't see why it's better for the other component to misalign the Ethernet frame
> for the benefit of later IP payload processing - it doesn't necessarily know that the IP payload
> is going to be interesting later, and would not be being unreasonable if it believed that
> the Ethernet packet itself was more interesting than the payload of its payload.
>
> > > > > > Anyway sequential consistency or anything like it is not going to be implemented as it does put in
> > > > > > too much of a constraint on the hardware for no particular advantage. You can think of it as not caring
> > > > > > about bugs but sequential consistency in hardware fo all memory operations is just a very bad idea
> > > > > > for both hardware and software. Think of it if you like as the simplicity of RISC which meant we don't
> > > > > > have interrupts on overflows or automatic bounds checks on array accesses - now there's an area where
> > > > > > if we really cared about bugs rather than performance we would have done things differently.
> > > > >
> > > > > On the other hand, the hardware doesn't actually do "memory barrier" operations directly; it passes
> > > > > messages around in the cache coherency protocol (MOESI or similar). If you really want to make the
> > > > > hardware's life simple (so that it can really scream), you'd surely push back and make software decide
> > > > > exactly which cache coherency messages it wants to send and when it forces write back from cache
> > > > > to RAM - indeed, in some senses, a Cell SPU forced the developers to do exactly that.
> > > >
> > > > I'm well aware of the normal ways that coherency is achieved. No my main concern is that code should be
> > > > bug free I think guaranteeing total order or sequential consistency or even data dependency just encourages
> > > > bugs. Acquire release operations and some atomics can do practically everything well and avoid the bugs.
> > > > I am not saying we should make life difficult for the software or do things in a complicated way.
> > > >
> > >
> > > But sequential consistency is easier for software than
> > > weaker consistency models, as is only having one load
> > > operation that works as well as possible for all types of
> > > load. If you're talking about having separate unaligned
> > > and aligned load operations, and weaker consistency models,
> > > you're making things harder for the software, and
> > > making the software more complicated; I argue that to do so needs very strong justification.
> > >
> > > In the case of sequential consistency, the argument is that the consistency traffic makes performance
> > > impossible in a parallel system; therefore, you need to weaken it. TSO (the x86 model) is still
> > > easy for software to reason about, needing very few barrier operations (and those mostly in core
> > > libraries, not in user code) to get a memory model that most users can reason about.
> >
> > Sequential consistency can be used in computer science papers to construct elegant algorithms.
> > If we could ensure that the only use of it was by people who could actually do it properly
> > that would be fine. However that is not what happens in the real world. Computer studies
> > teach loads of people about this and they try to use it in practice. People without a clue
> > read up about it and pat themselves on the back for their cleverness in using it.
> >
> > You seem to be startig with the assumption that there is a special requirement to load 4 byte
> > or eight byte unaligned strings but not 5 byte ones. You are also starting with the assumption
> > that there is some requirement to use what people have learnt about sequential consistency
> > or total order to devise a clever algorithm rather than do things straightforwardly.
> >
> > The real question is whether there is something important that we can't do straightforwardly without
> > some elegant CS algorithm reasoning about total order? If there is then can it be formulated in
> > a way that can be supported in the hardware well without encouraging stupid tricks?
> >
>
> No; I'm assuming that the site at which the mistake is made is likely to be distant from the
> site at which the bug manifests happens (see the Ethernet packet example). Given this, I expect
> debugging the issue to be challenging, as the code that's failing is "obviously" correct.
>
> I would prefer hardware to make debugging software simpler, not harder; it's better for my
> customers to have hardware that's 99% of peak possible performance, on a system where the hardware
> + software stack is bug-free than it is to have hardware that's capable of 100% performance,
> but the software stack has gremlins that manifest as non-reproducible glitches.
>
> > > > > The fact that we prefer to hide that simple approach underneath memory barriers and hardware
> > > > > cache controls suggests that we'd prefer to constrain the hardware to make the programming
> > > > > model easier to grasp - look at how hard developers found it to exploit the Cell SPUs.
> > > >
> > > > By simple you seem to mean some sort of direct hardware control judging from
> > > > what you just said before but I'm not certain that's what you mean by simple
> > > > here. I don't think the Cell processors are relevant to anything I have said.
> > > >
> > > The direct hardware control is as simple as possible for
> > > the hardware to implement - not only that, but it's
> > > an even weaker consistency model than Alpha (as it has no
> > > consistency at all, other than that imposed by software
> > > driving a consistency protocol). If you're really arguing
> > > that simple hardware is better, why should the hardware
> > > choose any consistency protocol, when it can expose the building blocks and let software handle it?
> >
> > The advantage of the Alpha approach is it doesn't encourage people to do stupid things
> > thinking they are clever. It is a side benefit that it is easier for the hardware.
>
> I raise the Alpha specifically because I had to debug code where people did "stupid things thinking they
> are clever" on Alpha-like hardware (an embedded application specific processor), and "knowing" that it worked
> because it was "obviously correct" and "works correctly on the predecessor dual 8051 system".
>
> I also note that you've not discussed the Cell SPU at all; the programmer model for the Cell SPU
> is that you have no direct access to coherent RAM. Instead, you set up transfers via a coherent
> DMA agent to shift data in and out of coherent RAM, and worked on your local memory only; this
> meets your goals, and yet was unpopular in the market - and mostly didn't perform better than
> a more conventional coherent cache core. My conclusion from this is that in large part, people
> won't make the effort to get performance up if they have to consider coherency too deeply.
By that reasoning not having any bounds check is good because somebody is sure to pass over the wrong size sometimes and a program would work correctly without the bounds check but fail with the bounds check.