By: Michael S (already5chosen.delete@this.yahoo.com), August 17, 2014 2:22 pm
Room: Moderated Discussions
Ricardo B (ricardo.b.delete@this.xxxxx.xx) on August 17, 2014 12:06 pm wrote:
> Ronald Maas (rmaas.delete@this.wiwo.nl) on August 17, 2014 9:57 am wrote:
> > In reality x86 will likely see the same amount of barriers compared
> > to other architectures. This because of the following scenario:
> > 1) Someone works on a project X that requires 5 memory barriers
> > to function. On x86 everything works flawlessly
> > 2) Got some critical bug reports because his software fails on other architectures. E.g. ARM
> > 3) Spend time to track down the root cause of the problem and after adding 10 extra memory
> > barriers in various places, finally the software start working properly on cores with weak
> > memory ordering. Actually only 1 or 2 extra barriers where needed, but developer did not
> > remove the unneeded ones because it is late already and need to catch some sleep
> > 4) Next day developer gets paranoid as he/she does not want
> > to spend other nights / week-ends burning the midnight
> > oil, reviews the code, and for good measure puts in another extra 10 barriers in different sections of the
> > code where he / she suspect it may be needed for e.g. PowerPC. Does not hurt to be careful. Right?
> > 5) Now at the end of the day, all architectures see 25 memory barriers regardless
> >
> > It is not that developers are generally incompetent, but there are always corner cases that
> > are not trivial to understand. As people in software developers usually work on a lot of pressure
> > to deliver, they they do not have the luxury to spend reading through a whole bunch of PDFs
> > in order to get a proper level of understanding. Especially when his / her manager is looking
> > over his shoulder to make sure that persons actually works on solving the problem.
>
>
> Yes and no.
>
> Again, taking the Linux kernel as an example, it has two families of barriers.
> rmb/wmb/mb are to be used to order access to devices even in single processor systems.
> smp_rmb/smp_wmb/smp_mb are to be used to order access across processors and
> they become NOPs if the kernel is compiled to single processor systems.
>
> Each of those barriers are then implemented in a target specific manner.
>
> In x86, the first barrier family is implemented as the rather expensive lfence/sfence/mfence instructions.
> But the second family (smp_*) doesn't generate any actual instructions,
> it just prevents the compiler from optimizing across the barrier.
>
Wouldn't it introduce subtle bugs in complex lockless scenarios? After all, x86 *does* promote later loads over earlier unrelated stores in software-visible manner. I don't expect for anything like that to happen in Linux kernel, because it just does not do crazy lockless stuff outside of one or two well-defined modules.
But if the same strategy used in other big portable programs it can cause troubles.
> Ronald Maas (rmaas.delete@this.wiwo.nl) on August 17, 2014 9:57 am wrote:
> > In reality x86 will likely see the same amount of barriers compared
> > to other architectures. This because of the following scenario:
> > 1) Someone works on a project X that requires 5 memory barriers
> > to function. On x86 everything works flawlessly
> > 2) Got some critical bug reports because his software fails on other architectures. E.g. ARM
> > 3) Spend time to track down the root cause of the problem and after adding 10 extra memory
> > barriers in various places, finally the software start working properly on cores with weak
> > memory ordering. Actually only 1 or 2 extra barriers where needed, but developer did not
> > remove the unneeded ones because it is late already and need to catch some sleep
> > 4) Next day developer gets paranoid as he/she does not want
> > to spend other nights / week-ends burning the midnight
> > oil, reviews the code, and for good measure puts in another extra 10 barriers in different sections of the
> > code where he / she suspect it may be needed for e.g. PowerPC. Does not hurt to be careful. Right?
> > 5) Now at the end of the day, all architectures see 25 memory barriers regardless
> >
> > It is not that developers are generally incompetent, but there are always corner cases that
> > are not trivial to understand. As people in software developers usually work on a lot of pressure
> > to deliver, they they do not have the luxury to spend reading through a whole bunch of PDFs
> > in order to get a proper level of understanding. Especially when his / her manager is looking
> > over his shoulder to make sure that persons actually works on solving the problem.
>
>
> Yes and no.
>
> Again, taking the Linux kernel as an example, it has two families of barriers.
> rmb/wmb/mb are to be used to order access to devices even in single processor systems.
> smp_rmb/smp_wmb/smp_mb are to be used to order access across processors and
> they become NOPs if the kernel is compiled to single processor systems.
>
> Each of those barriers are then implemented in a target specific manner.
>
> In x86, the first barrier family is implemented as the rather expensive lfence/sfence/mfence instructions.
> But the second family (smp_*) doesn't generate any actual instructions,
> it just prevents the compiler from optimizing across the barrier.
>
Wouldn't it introduce subtle bugs in complex lockless scenarios? After all, x86 *does* promote later loads over earlier unrelated stores in software-visible manner. I don't expect for anything like that to happen in Linux kernel, because it just does not do crazy lockless stuff outside of one or two well-defined modules.
But if the same strategy used in other big portable programs it can cause troubles.