By: nksingh (none.delete@this.none.non), August 21, 2014 2:54 pm
Room: Moderated Discussions
Michael S (already5chosen.delete@this.yahoo.com) on August 21, 2014 12:31 pm wrote:
> nksingh (none.delete@this.none.non) on August 21, 2014 11:34 am wrote:
> > > According to my understanding, the cheapest practical way to get effect of membar in
> > > x86 WB memory region would be reading from the address of last write. Or, if you want
> > > barrier after read (I never want, but I am not a lockless guy), writing to some dummy
> > > locations that you are likely to own and then reading that location back.
> >
> > From my interpretation of the x86 memory model and your statement above, I think you won't
> > get the ordering you desire. There's a squirrely exception in the x86 memory order model
> > for store-buffer forwarding. In the version of the Software Dev Manual I have on hand, this
> > behavior is spelled out in a section called "Intra-Processor Forwarding Is Allowed."
> >
>
> I will get ordering I desire, because I do not desire sequential consistency.
> All I want is to get as strong effect as architecturally guaranteed by MFENCE instruction.
>
> Pay attention that according to MEMORY ORDERING section of "Intel 64 and IA-32 Architectures
> Software Developer’s Manual" MFENCE does not help total order at all.
> Locked instructions appear to be the only documented way to achieve it.
The memory ordering section is where I was looking. Take a close look at the litmus test in section 8.2.3.5 of volume 3 in the latest manual (it's within the memory ordering section). The example shows that you can observe store-load reordering for later reads. To reproduce part of the example here:
The program written as:
mov [_x], 1
mov r1, [_x]
mov r2, [_y]
Can be viewed from another processor as:
mov r2, [_y]
mov [_x], 1
mov r1, [_x]
This is the exact same reordering you would observe if you didn't have the intervening read of [_x]. There's no free lunch. To get an mbar, you need to use a LOCK instruction or an appropriate *FENCE.
In Windows, we use "lock or [esp], 0," for a generic fence since that was the fastest method on the target CPUs when that compiler intrinsic was introduced.
> nksingh (none.delete@this.none.non) on August 21, 2014 11:34 am wrote:
> > > According to my understanding, the cheapest practical way to get effect of membar in
> > > x86 WB memory region would be reading from the address of last write. Or, if you want
> > > barrier after read (I never want, but I am not a lockless guy), writing to some dummy
> > > locations that you are likely to own and then reading that location back.
> >
> > From my interpretation of the x86 memory model and your statement above, I think you won't
> > get the ordering you desire. There's a squirrely exception in the x86 memory order model
> > for store-buffer forwarding. In the version of the Software Dev Manual I have on hand, this
> > behavior is spelled out in a section called "Intra-Processor Forwarding Is Allowed."
> >
>
> I will get ordering I desire, because I do not desire sequential consistency.
> All I want is to get as strong effect as architecturally guaranteed by MFENCE instruction.
>
> Pay attention that according to MEMORY ORDERING section of "Intel 64 and IA-32 Architectures
> Software Developer’s Manual" MFENCE does not help total order at all.
> Locked instructions appear to be the only documented way to achieve it.
The memory ordering section is where I was looking. Take a close look at the litmus test in section 8.2.3.5 of volume 3 in the latest manual (it's within the memory ordering section). The example shows that you can observe store-load reordering for later reads. To reproduce part of the example here:
The program written as:
mov [_x], 1
mov r1, [_x]
mov r2, [_y]
Can be viewed from another processor as:
mov r2, [_y]
mov [_x], 1
mov r1, [_x]
This is the exact same reordering you would observe if you didn't have the intervening read of [_x]. There's no free lunch. To get an mbar, you need to use a LOCK instruction or an appropriate *FENCE.
In Windows, we use "lock or [esp], 0," for a generic fence since that was the fastest method on the target CPUs when that compiler intrinsic was introduced.