By: Michael S (already5chosen.delete@this.yahoo.com), August 22, 2014 8:18 am
Room: Moderated Discussions
anon (anon.delete@this.anon.com) on August 22, 2014 7:33 am wrote:
> Michael S (already5chosen.delete@this.yahoo.com) on August 22, 2014 3:16 am wrote:
> > anon (anon.delete@this.anon.com) on August 21, 2014 11:17 pm wrote:
> > > Michael S (already5chosen.delete@this.yahoo.com) on August 21, 2014 12:31 pm wrote:
> > > > nksingh (none.delete@this.none.non) on August 21, 2014 11:34 am wrote:
> > > > > > According to my understanding, the cheapest practical way to get effect of membar in
> > > > > > x86 WB memory region would be reading from the address of last write. Or, if you want
> > > > > > barrier after read (I never want, but I am not a lockless guy), writing to some dummy
> > > > > > locations that you are likely to own and then reading that location back.
> > > > >
> > > > > From my interpretation of the x86 memory model and your statement above, I think you won't
> > > > > get the ordering you desire. There's a squirrely exception in the x86 memory order model
> > > > > for store-buffer forwarding. In the version of the Software Dev Manual I have on hand, this
> > > > > behavior is spelled out in a section called "Intra-Processor Forwarding Is Allowed."
> > > > >
> > > >
> > > > I will get ordering I desire, because I do not desire sequential consistency.
> > > > All I want is to get as strong effect as architecturally guaranteed by MFENCE instruction.
> > > >
> > > > Pay attention that according to MEMORY ORDERING section of "Intel 64 and IA-32 Architectures
> > > > Software Developer’s Manual" MFENCE does not help total order at all.
> > > > Locked instructions appear to be the only documented way to achieve it.
> > >
> > > [X] := 1
> > > r1 := [Y]
> > >
> > > vs
> > >
> > > [Y] := 1
> > > r2 := [X]
> > >
> > > With a global sequential ordering, the condition (r1 != 0 or r2 != 0) holds. It does
> > > not hold for x86, due to reordering loads before stores. x86 with barriers:
> > >
> > > [X] := 1
> > > mfence
> > > r1 := [Y]
> > >
> > > vs
> > >
> > > [Y] := 1
> > > mfence
> > > r2 := [X]
> > >
> > > Then the condition holds.
> >
> > I don't think so. According to my understanding of the rules, condition does not hold.
> > IMHO, you incorrectly interpret the rule that says "Reads
> > cannot pass earlier LFENCE and MFENCE instructions".
>
> This paragraph seems difficult to misinterpret:
>
> "The MFENCE instruction combines the functions of LFENCE and SFENCE by establishing a memory fence
> for both loads and stores. It guarantees that all loads and stores specified before the fence
> are globally observable prior to any loads or stores being carried out after the fence."
>
Yes, it seems you are right. The same said in paragraph that describes store buffer:
So, MFENCE is indeed stronger than store followed by load to the same location.
> > According to my understanding of this rule, it is strictly local and has no global effects.
>
> "Strictly local" ordering between loads and stores makes no sense. If you don't know when
> the store is going to become visible, it hardly matters when the load is carried out.
>
> > Looking at it from perspective of what is happening in hardware, I claim that mfence allowed to drain
> > local store queue, but does not obliged to drain it. So, despite fences, writes to [X] and [Y] can still
> > be in their respective store queues while reads are served from their respective local caches.
> >
> > > However, if I read your idea correctly:
> > >
> > >
> > > [X] := 1
> > > r8 := [X]
> > > r1 := [Y]
> > >
> > > vs
> > >
> > > [Y] := 1
> > > r9 := [Y]
> > > r2 := [X]
> > >
> > > I'm fairly sure this does NOT make the condition hold, exactly due to the store forwarding exception.
> > >
> > > If you are reading it as, "loads have to be in-order, therefore
> > > the 2nd load must be executed after the first,
> > > therefore the 2nd load must be executed after the store," then I can understand where you get the idea.
> >
> > Yes, that's the source.
> >
> > > However
> > > store forwarding exception is saying that loads can be satisified
> > > from a location before stores to that location
> > > become visible to other CPUs (it's actually more an exception
> > > to cache coherency more than memory consistency).
> > > The first load can be executed before the store becomes visible to other CPUs -- I cannot see any rule that
> > > says the second load can not also be executed before that store becomes visible.
> > >
> >
> > But mfence is no better. Only LOCK helps with total ordering over WB region.
>
> I think you read too much into this "total ordering" of locked instructions. You will actually notice that
> 8.2.3.7 example of single global store ordering for normal stores is exactly the same as 8.2.3.8 guarantee
> for single global store ordering for locked stores. The only difference that I can see, and presumably the
> reason they call it "total ordering", is because it does not allow the store buffer relaxation.
>
> Michael S (already5chosen.delete@this.yahoo.com) on August 22, 2014 3:16 am wrote:
> > anon (anon.delete@this.anon.com) on August 21, 2014 11:17 pm wrote:
> > > Michael S (already5chosen.delete@this.yahoo.com) on August 21, 2014 12:31 pm wrote:
> > > > nksingh (none.delete@this.none.non) on August 21, 2014 11:34 am wrote:
> > > > > > According to my understanding, the cheapest practical way to get effect of membar in
> > > > > > x86 WB memory region would be reading from the address of last write. Or, if you want
> > > > > > barrier after read (I never want, but I am not a lockless guy), writing to some dummy
> > > > > > locations that you are likely to own and then reading that location back.
> > > > >
> > > > > From my interpretation of the x86 memory model and your statement above, I think you won't
> > > > > get the ordering you desire. There's a squirrely exception in the x86 memory order model
> > > > > for store-buffer forwarding. In the version of the Software Dev Manual I have on hand, this
> > > > > behavior is spelled out in a section called "Intra-Processor Forwarding Is Allowed."
> > > > >
> > > >
> > > > I will get ordering I desire, because I do not desire sequential consistency.
> > > > All I want is to get as strong effect as architecturally guaranteed by MFENCE instruction.
> > > >
> > > > Pay attention that according to MEMORY ORDERING section of "Intel 64 and IA-32 Architectures
> > > > Software Developer’s Manual" MFENCE does not help total order at all.
> > > > Locked instructions appear to be the only documented way to achieve it.
> > >
> > > [X] := 1
> > > r1 := [Y]
> > >
> > > vs
> > >
> > > [Y] := 1
> > > r2 := [X]
> > >
> > > With a global sequential ordering, the condition (r1 != 0 or r2 != 0) holds. It does
> > > not hold for x86, due to reordering loads before stores. x86 with barriers:
> > >
> > > [X] := 1
> > > mfence
> > > r1 := [Y]
> > >
> > > vs
> > >
> > > [Y] := 1
> > > mfence
> > > r2 := [X]
> > >
> > > Then the condition holds.
> >
> > I don't think so. According to my understanding of the rules, condition does not hold.
> > IMHO, you incorrectly interpret the rule that says "Reads
> > cannot pass earlier LFENCE and MFENCE instructions".
>
> This paragraph seems difficult to misinterpret:
>
> "The MFENCE instruction combines the functions of LFENCE and SFENCE by establishing a memory fence
> for both loads and stores. It guarantees that all loads and stores specified before the fence
> are globally observable prior to any loads or stores being carried out after the fence."
>
Yes, it seems you are right. The same said in paragraph that describes store buffer:
In general, the existence of the store buffer is transparent to software, even in systems that use multiple processors.
The processor ensures that write operations are always carried out in program order. It also insures that the
contents of the store buffer are always drained to memory in the following situations:
...........
• (Pentium 4 and more recent processor families only) When using an MFENCE instruction to order stores.
So, MFENCE is indeed stronger than store followed by load to the same location.
> > According to my understanding of this rule, it is strictly local and has no global effects.
>
> "Strictly local" ordering between loads and stores makes no sense. If you don't know when
> the store is going to become visible, it hardly matters when the load is carried out.
>
> > Looking at it from perspective of what is happening in hardware, I claim that mfence allowed to drain
> > local store queue, but does not obliged to drain it. So, despite fences, writes to [X] and [Y] can still
> > be in their respective store queues while reads are served from their respective local caches.
> >
> > > However, if I read your idea correctly:
> > >
> > >
> > > [X] := 1
> > > r8 := [X]
> > > r1 := [Y]
> > >
> > > vs
> > >
> > > [Y] := 1
> > > r9 := [Y]
> > > r2 := [X]
> > >
> > > I'm fairly sure this does NOT make the condition hold, exactly due to the store forwarding exception.
> > >
> > > If you are reading it as, "loads have to be in-order, therefore
> > > the 2nd load must be executed after the first,
> > > therefore the 2nd load must be executed after the store," then I can understand where you get the idea.
> >
> > Yes, that's the source.
> >
> > > However
> > > store forwarding exception is saying that loads can be satisified
> > > from a location before stores to that location
> > > become visible to other CPUs (it's actually more an exception
> > > to cache coherency more than memory consistency).
> > > The first load can be executed before the store becomes visible to other CPUs -- I cannot see any rule that
> > > says the second load can not also be executed before that store becomes visible.
> > >
> >
> > But mfence is no better. Only LOCK helps with total ordering over WB region.
>
> I think you read too much into this "total ordering" of locked instructions. You will actually notice that
> 8.2.3.7 example of single global store ordering for normal stores is exactly the same as 8.2.3.8 guarantee
> for single global store ordering for locked stores. The only difference that I can see, and presumably the
> reason they call it "total ordering", is because it does not allow the store buffer relaxation.
>