By: anon (anon.delete@this.anon.com), August 22, 2014 5:50 pm
Room: Moderated Discussions
Michael S (already5chosen.delete@this.yahoo.com) on August 22, 2014 8:18 am wrote:
> anon (anon.delete@this.anon.com) on August 22, 2014 7:33 am wrote:
> > Michael S (already5chosen.delete@this.yahoo.com) on August 22, 2014 3:16 am wrote:
> > > anon (anon.delete@this.anon.com) on August 21, 2014 11:17 pm wrote:
> > > > Michael S (already5chosen.delete@this.yahoo.com) on August 21, 2014 12:31 pm wrote:
> > > > > nksingh (none.delete@this.none.non) on August 21, 2014 11:34 am wrote:
> > > > > > > According to my understanding, the cheapest practical way to get effect of membar in
> > > > > > > x86 WB memory region would be reading from the address of last write. Or, if you want
> > > > > > > barrier after read (I never want, but I am not a lockless guy), writing to some dummy
> > > > > > > locations that you are likely to own and then reading that location back.
> > > > > >
> > > > > > From my interpretation of the x86 memory model and your statement above, I think you won't
> > > > > > get the ordering you desire. There's a squirrely exception in the x86 memory order model
> > > > > > for store-buffer forwarding. In the version of the Software Dev Manual I have on hand, this
> > > > > > behavior is spelled out in a section called "Intra-Processor Forwarding Is Allowed."
> > > > > >
> > > > >
> > > > > I will get ordering I desire, because I do not desire sequential consistency.
> > > > > All I want is to get as strong effect as architecturally guaranteed by MFENCE instruction.
> > > > >
> > > > > Pay attention that according to MEMORY ORDERING section of "Intel 64 and IA-32 Architectures
> > > > > Software Developer’s Manual" MFENCE does not help total order at all.
> > > > > Locked instructions appear to be the only documented way to achieve it.
> > > >
> > > > [X] := 1
> > > > r1 := [Y]
> > > >
> > > > vs
> > > >
> > > > [Y] := 1
> > > > r2 := [X]
> > > >
> > > > With a global sequential ordering, the condition (r1 != 0 or r2 != 0) holds. It does
> > > > not hold for x86, due to reordering loads before stores. x86 with barriers:
> > > >
> > > > [X] := 1
> > > > mfence
> > > > r1 := [Y]
> > > >
> > > > vs
> > > >
> > > > [Y] := 1
> > > > mfence
> > > > r2 := [X]
> > > >
> > > > Then the condition holds.
> > >
> > > I don't think so. According to my understanding of the rules, condition does not hold.
> > > IMHO, you incorrectly interpret the rule that says "Reads
> > > cannot pass earlier LFENCE and MFENCE instructions".
> >
> > This paragraph seems difficult to misinterpret:
> >
> > "The MFENCE instruction combines the functions of LFENCE and SFENCE by establishing a memory fence
> > for both loads and stores. It guarantees that all loads and stores specified before the fence
> > are globally observable prior to any loads or stores being carried out after the fence."
> >
>
> Yes, it seems you are right. The same said in paragraph that describes store buffer:
>
>
> So, MFENCE is indeed stronger than store followed by load to the same location.
I really don't think following a store with a load to the same location does as much as you think. I doubt it does *anything* that you can rely on, actually.
Say if you have
[X] := 1
[Y] := 1
r1 := [Y]
r2 := [Z]
Now, again you might think, stores are ordered wrt each other, loads are ordered wrt each other, and the load of Y cannot proceed before the store of Y, therefore the load of Z certainly cannot proceed before the store of X.
But as far as I can tell, that would still be a false assumption. Because both stores may be in the store buffer, and in terms of observability, the manual says that loads to the same location are allowed to proceed before a store to that location. Obviously they're not talking about local effects there (because that would break single thread ordering). So in fact Z can be loaded before other CPUs can observe the store to X.
> anon (anon.delete@this.anon.com) on August 22, 2014 7:33 am wrote:
> > Michael S (already5chosen.delete@this.yahoo.com) on August 22, 2014 3:16 am wrote:
> > > anon (anon.delete@this.anon.com) on August 21, 2014 11:17 pm wrote:
> > > > Michael S (already5chosen.delete@this.yahoo.com) on August 21, 2014 12:31 pm wrote:
> > > > > nksingh (none.delete@this.none.non) on August 21, 2014 11:34 am wrote:
> > > > > > > According to my understanding, the cheapest practical way to get effect of membar in
> > > > > > > x86 WB memory region would be reading from the address of last write. Or, if you want
> > > > > > > barrier after read (I never want, but I am not a lockless guy), writing to some dummy
> > > > > > > locations that you are likely to own and then reading that location back.
> > > > > >
> > > > > > From my interpretation of the x86 memory model and your statement above, I think you won't
> > > > > > get the ordering you desire. There's a squirrely exception in the x86 memory order model
> > > > > > for store-buffer forwarding. In the version of the Software Dev Manual I have on hand, this
> > > > > > behavior is spelled out in a section called "Intra-Processor Forwarding Is Allowed."
> > > > > >
> > > > >
> > > > > I will get ordering I desire, because I do not desire sequential consistency.
> > > > > All I want is to get as strong effect as architecturally guaranteed by MFENCE instruction.
> > > > >
> > > > > Pay attention that according to MEMORY ORDERING section of "Intel 64 and IA-32 Architectures
> > > > > Software Developer’s Manual" MFENCE does not help total order at all.
> > > > > Locked instructions appear to be the only documented way to achieve it.
> > > >
> > > > [X] := 1
> > > > r1 := [Y]
> > > >
> > > > vs
> > > >
> > > > [Y] := 1
> > > > r2 := [X]
> > > >
> > > > With a global sequential ordering, the condition (r1 != 0 or r2 != 0) holds. It does
> > > > not hold for x86, due to reordering loads before stores. x86 with barriers:
> > > >
> > > > [X] := 1
> > > > mfence
> > > > r1 := [Y]
> > > >
> > > > vs
> > > >
> > > > [Y] := 1
> > > > mfence
> > > > r2 := [X]
> > > >
> > > > Then the condition holds.
> > >
> > > I don't think so. According to my understanding of the rules, condition does not hold.
> > > IMHO, you incorrectly interpret the rule that says "Reads
> > > cannot pass earlier LFENCE and MFENCE instructions".
> >
> > This paragraph seems difficult to misinterpret:
> >
> > "The MFENCE instruction combines the functions of LFENCE and SFENCE by establishing a memory fence
> > for both loads and stores. It guarantees that all loads and stores specified before the fence
> > are globally observable prior to any loads or stores being carried out after the fence."
> >
>
> Yes, it seems you are right. The same said in paragraph that describes store buffer:
>
> In general, the existence of the store buffer is transparent
> to software, even in systems that use multiple processors.
> The processor ensures that write operations are always
> carried out in program order. It also insures that the
> contents of the store buffer are always drained to memory in the following situations:
> ...........
> • (Pentium 4 and more recent processor families only) When using an MFENCE instruction to order stores.
>
> So, MFENCE is indeed stronger than store followed by load to the same location.
I really don't think following a store with a load to the same location does as much as you think. I doubt it does *anything* that you can rely on, actually.
Say if you have
[X] := 1
[Y] := 1
r1 := [Y]
r2 := [Z]
Now, again you might think, stores are ordered wrt each other, loads are ordered wrt each other, and the load of Y cannot proceed before the store of Y, therefore the load of Z certainly cannot proceed before the store of X.
But as far as I can tell, that would still be a false assumption. Because both stores may be in the store buffer, and in terms of observability, the manual says that loads to the same location are allowed to proceed before a store to that location. Obviously they're not talking about local effects there (because that would break single thread ordering). So in fact Z can be loaded before other CPUs can observe the store to X.