By: Michael S (already5chosen.delete@this.yahoo.com), August 21, 2014 6:40 pm
Room: Moderated Discussions
nksingh (none.delete@this.none.non) on August 21, 2014 2:54 pm wrote:
> Michael S (already5chosen.delete@this.yahoo.com) on August 21, 2014 12:31 pm wrote:
> > nksingh (none.delete@this.none.non) on August 21, 2014 11:34 am wrote:
> > > > According to my understanding, the cheapest practical way to get effect of membar in
> > > > x86 WB memory region would be reading from the address of last write. Or, if you want
> > > > barrier after read (I never want, but I am not a lockless guy), writing to some dummy
> > > > locations that you are likely to own and then reading that location back.
> > >
> > > From my interpretation of the x86 memory model and your statement above, I think you won't
> > > get the ordering you desire. There's a squirrely exception in the x86 memory order model
> > > for store-buffer forwarding. In the version of the Software Dev Manual I have on hand, this
> > > behavior is spelled out in a section called "Intra-Processor Forwarding Is Allowed."
> > >
> >
> > I will get ordering I desire, because I do not desire sequential consistency.
> > All I want is to get as strong effect as architecturally guaranteed by MFENCE instruction.
> >
> > Pay attention that according to MEMORY ORDERING section of "Intel 64 and IA-32 Architectures
> > Software Developer’s Manual" MFENCE does not help total order at all.
> > Locked instructions appear to be the only documented way to achieve it.
>
> The memory ordering section is where I was looking. Take a close look at the litmus test in section 8.2.3.5
> of volume 3 in the latest manual (it's within the memory ordering section). The example shows that you
> can observe store-load reordering for later reads. To reproduce part of the example here:
> The program written as:
> mov [_x], 1
> mov r1, [_x]
> mov r2, [_y]
>
> Can be viewed from another processor as:
> mov r2, [_y]
> mov [_x], 1
> mov r1, [_x]
>
> This is the exact same reordering you would observe if you didn't have the intervening read of [_x].
> There's no free lunch. To get an mbar, you need to use a LOCK instruction or an appropriate *FENCE.
I think, you are wrong about the later. Architecturally, only LOCK instruction can guarantee that "r2 = 0 and r4 = 0" would not happen. Fences between instructions are not going to guarantee anything at all.
>
> In Windows, we use "lock or [esp], 0," for a generic fence since that was the fastest
> method on the target CPUs when that compiler intrinsic was introduced.
>
Depends on what you want to achieve. If the purpose is a total order in WB region than lock is not just the fastest method on particular processor, but the only method that is guaranteed to work. Even if there exist combination of non-atomic fences that happens to achieve the same result on all past and current Intel and AMD processors, it is still likely to be broken in the future.
BTW, in what situation Windows can want "generic fence"? Right now I can't see where it is potentially useful for anything not absolutely crazy.
> Michael S (already5chosen.delete@this.yahoo.com) on August 21, 2014 12:31 pm wrote:
> > nksingh (none.delete@this.none.non) on August 21, 2014 11:34 am wrote:
> > > > According to my understanding, the cheapest practical way to get effect of membar in
> > > > x86 WB memory region would be reading from the address of last write. Or, if you want
> > > > barrier after read (I never want, but I am not a lockless guy), writing to some dummy
> > > > locations that you are likely to own and then reading that location back.
> > >
> > > From my interpretation of the x86 memory model and your statement above, I think you won't
> > > get the ordering you desire. There's a squirrely exception in the x86 memory order model
> > > for store-buffer forwarding. In the version of the Software Dev Manual I have on hand, this
> > > behavior is spelled out in a section called "Intra-Processor Forwarding Is Allowed."
> > >
> >
> > I will get ordering I desire, because I do not desire sequential consistency.
> > All I want is to get as strong effect as architecturally guaranteed by MFENCE instruction.
> >
> > Pay attention that according to MEMORY ORDERING section of "Intel 64 and IA-32 Architectures
> > Software Developer’s Manual" MFENCE does not help total order at all.
> > Locked instructions appear to be the only documented way to achieve it.
>
> The memory ordering section is where I was looking. Take a close look at the litmus test in section 8.2.3.5
> of volume 3 in the latest manual (it's within the memory ordering section). The example shows that you
> can observe store-load reordering for later reads. To reproduce part of the example here:
> The program written as:
> mov [_x], 1
> mov r1, [_x]
> mov r2, [_y]
>
> Can be viewed from another processor as:
> mov r2, [_y]
> mov [_x], 1
> mov r1, [_x]
>
> This is the exact same reordering you would observe if you didn't have the intervening read of [_x].
> There's no free lunch. To get an mbar, you need to use a LOCK instruction or an appropriate *FENCE.
I think, you are wrong about the later. Architecturally, only LOCK instruction can guarantee that "r2 = 0 and r4 = 0" would not happen. Fences between instructions are not going to guarantee anything at all.
>
> In Windows, we use "lock or [esp], 0," for a generic fence since that was the fastest
> method on the target CPUs when that compiler intrinsic was introduced.
>
Depends on what you want to achieve. If the purpose is a total order in WB region than lock is not just the fastest method on particular processor, but the only method that is guaranteed to work. Even if there exist combination of non-atomic fences that happens to achieve the same result on all past and current Intel and AMD processors, it is still likely to be broken in the future.
BTW, in what situation Windows can want "generic fence"? Right now I can't see where it is potentially useful for anything not absolutely crazy.