By: dmcq (dmcq.delete@this.fano.co.uk), July 15, 2015 10:09 am
Room: Moderated Discussions
Michael S (already5chosen.delete@this.yahoo.com) on July 15, 2015 8:29 am wrote:
> dmcq (dmcq.delete@this.fano.co.uk) on July 15, 2015 8:04 am wrote:
> > NoSpammer (no.delete@this.spam.com) on July 15, 2015 7:05 am wrote:
> > > anon (anon.delete@this.anon.com) on July 15, 2015 2:26 am wrote:
> > > > Michael S (already5chosen.delete@this.yahoo.com) on July 15, 2015 1:19 am wrote:
> > > > > anon (anon.delete@this.anon.com) on July 14, 2015 9:37 pm wrote:
> > > > > > NoSpammer (no.delete@this.spam.com) on July 14, 2015 2:02 pm wrote:
> > > > > > > Wilco (Wilco.Dijkstra.delete@this.ntlworld.com) on July 13, 2015 4:50 pm wrote:
> > > > > > > > NoSpammer (no.delete@this.spam.com) on July 13, 2015 12:34 pm wrote:
> > > > > > > > > With x86 a volatile variable is at least up to date monotonically with respect to what you've seen
> > > > > > > > > so far. That's actually good enough and it's excellent for producer-consumer type of stuff.
> > > > > > > >
> > > > > > > > That's not even true on x86. Loads may be lifted before stores
> > > > > > > > and so you may read an incorrect or "impossible" value.
> > > > > > >
> > > > > > > Let me rephrase it: suppose you have a volatile location
> > > > > > > you observe, it's possible that you see all changes
> > > > > > > prior to a volatile change, but not yet the volatile change, it's also possible that you see all changes
> > > > > > > and the volatile changed, but it's not possible to see a volatile changed but then in later code not see
> > > > > > > prior changes.
> > > > > >
> > > > > > At it's most general, looking at system-wide state, this is untrue. Forwarding from the store
> > > > > > queue means that a write can be seen (from one CPU) and then not be seen (from another).
> > > > > >
> > > > >
> > > > > It seems, in explanation above NoSpammer was taking about situation in which all changes of interest
> > > > > are done by one (producer) CPU and observed by another one (consumer) CPU. For such situation x86 ordering
> > > > > guarantees are not just strong enough, but stronger than necessary.
> > > >
> > > > It sounded like he made a general statement about "the system". He said that the constraint is
> > > > good for producer-consumer as a particular usage, but made a more general statement first.
> > >
> > > I think it's correct even in the general case when you observe (read) from one thread what the
> > > other threads are writing. You will observe their writes in order, but of course some randomness
> > > in which one will get ahead when. I was not claiming that the value is up to date system-wise,
> > > only that it is at least up to date with respect to your previous reads (in the same thread).
> > >
> > > > Of course all this is making a number of assumptions to begin with, no non-temporal
> > > > moves, no string instructions, clflush; memory is aligned, etc.
> > >
> > > These are not really issues is you WANT to write non-interlocked code. Also you can get around by
> > > doing the right thing just before sync-point. You don't really want to use non-temporal instructions
> > > and fancy flushes in these cases. As for string instructions (rep stos/rep movs) I've relied on the
> > > observation that if you see the first bytes correctly and the last bytes correctly, you also probably
> > > see everything inbetween. Running CRC checks on such messages for about 3 years has shown no errors
> > > so I'm confident this works (why would the CPU push out string writes in random order).
> >
> > I doubt a CPU would want to push it out in random order, but
> > consider the problem in general with weak consistency.
> > If there are two memory controllers and the string occupies
> > three lines then the second memory controller might
> > be satisfying some requests by another CPU delaying the write of the middle section of the string.
> >
>
> As long as we are talking about ordering in WB/WT regions, order of arrival to external memory does
> not matter. What matters is *observed* order governed by cache lines ownership. And the later, in
> case of x86 strings instructions, can't deviate too far from the continuous, because on x86 strings
> instructions are interruptible/restartable and the only saved states are direction, two pointers and
> counter. With such minimal state saved only continuous operation can be correctly restarted.
There you are going making the current behavior a constraint instead of telling the hardware what is required. That sort o thing means that they can't implement fire and forget type store, the cache line needs to be kept around instead of being reused even if nobody else is currently using it until one is certain that the value has been stored away such that any read from another CPU will get the updated value in order compared to other writes from this CPU.
> dmcq (dmcq.delete@this.fano.co.uk) on July 15, 2015 8:04 am wrote:
> > NoSpammer (no.delete@this.spam.com) on July 15, 2015 7:05 am wrote:
> > > anon (anon.delete@this.anon.com) on July 15, 2015 2:26 am wrote:
> > > > Michael S (already5chosen.delete@this.yahoo.com) on July 15, 2015 1:19 am wrote:
> > > > > anon (anon.delete@this.anon.com) on July 14, 2015 9:37 pm wrote:
> > > > > > NoSpammer (no.delete@this.spam.com) on July 14, 2015 2:02 pm wrote:
> > > > > > > Wilco (Wilco.Dijkstra.delete@this.ntlworld.com) on July 13, 2015 4:50 pm wrote:
> > > > > > > > NoSpammer (no.delete@this.spam.com) on July 13, 2015 12:34 pm wrote:
> > > > > > > > > With x86 a volatile variable is at least up to date monotonically with respect to what you've seen
> > > > > > > > > so far. That's actually good enough and it's excellent for producer-consumer type of stuff.
> > > > > > > >
> > > > > > > > That's not even true on x86. Loads may be lifted before stores
> > > > > > > > and so you may read an incorrect or "impossible" value.
> > > > > > >
> > > > > > > Let me rephrase it: suppose you have a volatile location
> > > > > > > you observe, it's possible that you see all changes
> > > > > > > prior to a volatile change, but not yet the volatile change, it's also possible that you see all changes
> > > > > > > and the volatile changed, but it's not possible to see a volatile changed but then in later code not see
> > > > > > > prior changes.
> > > > > >
> > > > > > At it's most general, looking at system-wide state, this is untrue. Forwarding from the store
> > > > > > queue means that a write can be seen (from one CPU) and then not be seen (from another).
> > > > > >
> > > > >
> > > > > It seems, in explanation above NoSpammer was taking about situation in which all changes of interest
> > > > > are done by one (producer) CPU and observed by another one (consumer) CPU. For such situation x86 ordering
> > > > > guarantees are not just strong enough, but stronger than necessary.
> > > >
> > > > It sounded like he made a general statement about "the system". He said that the constraint is
> > > > good for producer-consumer as a particular usage, but made a more general statement first.
> > >
> > > I think it's correct even in the general case when you observe (read) from one thread what the
> > > other threads are writing. You will observe their writes in order, but of course some randomness
> > > in which one will get ahead when. I was not claiming that the value is up to date system-wise,
> > > only that it is at least up to date with respect to your previous reads (in the same thread).
> > >
> > > > Of course all this is making a number of assumptions to begin with, no non-temporal
> > > > moves, no string instructions, clflush; memory is aligned, etc.
> > >
> > > These are not really issues is you WANT to write non-interlocked code. Also you can get around by
> > > doing the right thing just before sync-point. You don't really want to use non-temporal instructions
> > > and fancy flushes in these cases. As for string instructions (rep stos/rep movs) I've relied on the
> > > observation that if you see the first bytes correctly and the last bytes correctly, you also probably
> > > see everything inbetween. Running CRC checks on such messages for about 3 years has shown no errors
> > > so I'm confident this works (why would the CPU push out string writes in random order).
> >
> > I doubt a CPU would want to push it out in random order, but
> > consider the problem in general with weak consistency.
> > If there are two memory controllers and the string occupies
> > three lines then the second memory controller might
> > be satisfying some requests by another CPU delaying the write of the middle section of the string.
> >
>
> As long as we are talking about ordering in WB/WT regions, order of arrival to external memory does
> not matter. What matters is *observed* order governed by cache lines ownership. And the later, in
> case of x86 strings instructions, can't deviate too far from the continuous, because on x86 strings
> instructions are interruptible/restartable and the only saved states are direction, two pointers and
> counter. With such minimal state saved only continuous operation can be correctly restarted.
There you are going making the current behavior a constraint instead of telling the hardware what is required. That sort o thing means that they can't implement fire and forget type store, the cache line needs to be kept around instead of being reused even if nobody else is currently using it until one is certain that the value has been stored away such that any read from another CPU will get the updated value in order compared to other writes from this CPU.