By: Michael S (already5chosen.delete@this.yahoo.com), August 17, 2014 5:56 pm
Room: Moderated Discussions
Ricardo B (ricardo.b.delete@this.xxxxx.xx) on August 17, 2014 3:14 pm wrote:
> Michael S (already5chosen.delete@this.yahoo.com) on August 17, 2014 2:22 pm wrote:
> > Wouldn't it introduce subtle bugs in complex lockless scenarios?
> > After all, x86 *does* promote later loads over
> > earlier unrelated stores in software-visible manner. I don't
> > expect for anything like that to happen in Linux
> > kernel, because it just does not do crazy lockless stuff outside of one or two well-defined modules.
> > But if the same strategy used in other big portable programs it can cause troubles.
>
>
> Yes, you need a more encompassing strategy than just mapping the barriers to NOPs.
>
> In x86, the atomic operations (eg, LOCK ADD) serve as barrier for the Store over Load reordering case.
>
> So, one strategy is to make all your barrier NOPs but ensure that you always have an atomic operation.
> This is, I think, the strategy on Linux: all the lockless stuff is made using a
> series of atomic_* functions, which in Linux map to atomic x86 instructions.
>
> Another is to map barriers to otherwise unused atomic x86 instructions
I am not sure that atomic cure is better than membar disease.
The big theoretical problem is that Part of architectural semantic of x86 atomics is global ordering. That can be pretty expensive when the same cache line is updated almost simultaneously by several writers on big cc-NUMA. Potentially more expensive than mere membar.
The big practical problem is that on nearly all existing implementations atomics are much slower than non-atomics even when issuing core already have cache line in exclusive or modified state.
According to my understanding, the cheapest practical way to get effect of membar in x86 WB memory region would be reading from the address of last write. Or, if you want barrier after read (I never want, but I am not a lockless guy), writing to some dummy locations that you are likely to own and then reading that location back.
> Michael S (already5chosen.delete@this.yahoo.com) on August 17, 2014 2:22 pm wrote:
> > Wouldn't it introduce subtle bugs in complex lockless scenarios?
> > After all, x86 *does* promote later loads over
> > earlier unrelated stores in software-visible manner. I don't
> > expect for anything like that to happen in Linux
> > kernel, because it just does not do crazy lockless stuff outside of one or two well-defined modules.
> > But if the same strategy used in other big portable programs it can cause troubles.
>
>
> Yes, you need a more encompassing strategy than just mapping the barriers to NOPs.
>
> In x86, the atomic operations (eg, LOCK ADD) serve as barrier for the Store over Load reordering case.
>
> So, one strategy is to make all your barrier NOPs but ensure that you always have an atomic operation.
> This is, I think, the strategy on Linux: all the lockless stuff is made using a
> series of atomic_* functions, which in Linux map to atomic x86 instructions.
>
> Another is to map barriers to otherwise unused atomic x86 instructions
I am not sure that atomic cure is better than membar disease.
The big theoretical problem is that Part of architectural semantic of x86 atomics is global ordering. That can be pretty expensive when the same cache line is updated almost simultaneously by several writers on big cc-NUMA. Potentially more expensive than mere membar.
The big practical problem is that on nearly all existing implementations atomics are much slower than non-atomics even when issuing core already have cache line in exclusive or modified state.
According to my understanding, the cheapest practical way to get effect of membar in x86 WB memory region would be reading from the address of last write. Or, if you want barrier after read (I never want, but I am not a lockless guy), writing to some dummy locations that you are likely to own and then reading that location back.