By: Linus Torvalds (torvalds.delete@this.osdl.org), October 27, 2006 6:28 pm
Room: Moderated Discussions
Tzvetan Mikov (tzvetanmi@yahoo.com) on 10/27/06 wrote:
>
>I am blatantly attempting to hijack the subject:
>
>Are there any technical reasons, other than ifetch
>bandwidth, which make explicit barriers more expensive than
>implicit ones, if the goal is to make them fast ?
Well, you could always make all barriers implicit, and then
the explicit barriers are no-ops, so in that sense, they'd
be free (apart from I$). But they'd be free only if they are
pointless, so that's not a very interesting case, and not
one that argues for explicit barriers..
I'd argue that there are two costs to real barriers:
- the software cost. The reason x86 doesn't have them
is that a lot of software doesn't have them, and a
lot of programmers don't understand them. I think the
example of the alpha memory ordering issue should have
shown that even clueful programmers simply don't
know or even realize that they don't know the
subtle pitfalls in this area.
So not having to have barriers is actually a real
advantage. Like it or not, it's simply easier to
generate code that actually works on x86 than it is on
alpha, and it has nothing to do with performance. Trust
me, even among kernel programmers, most don't really
realize how subtle these things can be. And we're the
best of the best (*), and I'd claim that we work more
with true SMP threads than anybody else does.
(*) And so very modest, too.
So don't ignore this cost. It's probably the biggest one,
and it makes any performance issues secondary.
- I don't think how many people realize this, but a
"memory barrier" is not actually all that well-defined.
You actually have a lot of cases, and it's not
like there is only one type. A true memory barrier is
actually more like a half-permeating barrier that can
recognize the type of the memory operation: you can have
barriers that do not allow later reads to pass
the barrier one way, but allows earlier reads to
pass it, or any writes to be buffered past it.
In other words, pretty much every memory barrier
model out there actually simplifies the barrier model
to a few common cases, but in doing so, it may force the
programmer to use a sledgehammer where a more subtle
ordering would make more sense. And yes, that can mean
that implicit barriers can actually do better.
More importantly, a lot of barriers are totally and
utterly unnecessary 99.999% of the time. A really
highperformance implementation may re-order things
very aggressively indeed, and then just check the end
result - for example, if all the memory locations
involved are in the L1 cache, and none of them take a
cache miss, then by definition the ordering didn't
matter, because nobody else could have seen it.
This last point is also important: you don't actually have
to care about ordering, unless there was traffic from
another core that causes you to have to care. That's some-
what simplified, of course: one of the reasons that the
memory pipeline is the most complex part of a modern CPU
is not even SMP effects, but making sure that the memory
accesses appear "in program order" even for just the
local thread.
So notice how a CPU actually does a lot of this careful
work to make execution look serialized already. In
many ways, SMP memory ordering is not really any different.
It just means that you may have to check that the cache
lines you accessed are still under your own ownership
until you can really commit the operation, but the point
here is that you could actually do an out-of-order CPU that
treated the (very rare) ordering mistake as a micro-fault
and just replayed it.
And what is the result of that kind of advanced CPU core?
It just means that all explicit ordering is pointless.
The core already tracks the ordering requirements for
purely local reasons (ie it may have executed a store
earlier, but it can't really retire it and "commit" it
until all previous work has finished).
In many ways, memory ordering instructions can just be
seen as a sign that the CPU isn't tracking everything it
could track, so the software needs to tell it to flush
some queue explicitly. That's not really software's job,
and especially since a lot of this is dynamic, the hw could
do better if it just tracked itself instead of flushing
things unnecessarily!
But see above on the reason why you shouldn't expect sw
to do a great job. And besides - judging by past behaviour,
I suspect we'll continue to see smarter and smarter CPU's,
with less and less upside from explitic barriers. And at
some point, the upside just doesn't really exist at all,
and explicit memory barriers have only problems.
Will we get there? Maybe not. But I seriously believe that
the x86 is a better model for a reasonable future than the
alpha was (or powerpc, or mips, or ..). It has some
barriers, but they are pretty minimal (and I'm actually
told that even the read barrier may not actually be
needed by any ever shipping x86 core).
Side note: the x86 actually has "lfence" and "sfence"
barriers, and they are all about uncached but buffered
IO accesses! Once you realize that, go back and read
my point above about replay etc - that obviously will
not work in the presense of actual IO loads that have
side effects. So it turns out that memory barriers are
still needed, but they are relegated entirely to just be
about IO, not memory!
Linus
>
>I am blatantly attempting to hijack the subject:
>
>Are there any technical reasons, other than ifetch
>bandwidth, which make explicit barriers more expensive than
>implicit ones, if the goal is to make them fast ?
Well, you could always make all barriers implicit, and then
the explicit barriers are no-ops, so in that sense, they'd
be free (apart from I$). But they'd be free only if they are
pointless, so that's not a very interesting case, and not
one that argues for explicit barriers..
I'd argue that there are two costs to real barriers:
- the software cost. The reason x86 doesn't have them
is that a lot of software doesn't have them, and a
lot of programmers don't understand them. I think the
example of the alpha memory ordering issue should have
shown that even clueful programmers simply don't
know or even realize that they don't know the
subtle pitfalls in this area.
So not having to have barriers is actually a real
advantage. Like it or not, it's simply easier to
generate code that actually works on x86 than it is on
alpha, and it has nothing to do with performance. Trust
me, even among kernel programmers, most don't really
realize how subtle these things can be. And we're the
best of the best (*), and I'd claim that we work more
with true SMP threads than anybody else does.
(*) And so very modest, too.
So don't ignore this cost. It's probably the biggest one,
and it makes any performance issues secondary.
- I don't think how many people realize this, but a
"memory barrier" is not actually all that well-defined.
You actually have a lot of cases, and it's not
like there is only one type. A true memory barrier is
actually more like a half-permeating barrier that can
recognize the type of the memory operation: you can have
barriers that do not allow later reads to pass
the barrier one way, but allows earlier reads to
pass it, or any writes to be buffered past it.
In other words, pretty much every memory barrier
model out there actually simplifies the barrier model
to a few common cases, but in doing so, it may force the
programmer to use a sledgehammer where a more subtle
ordering would make more sense. And yes, that can mean
that implicit barriers can actually do better.
More importantly, a lot of barriers are totally and
utterly unnecessary 99.999% of the time. A really
highperformance implementation may re-order things
very aggressively indeed, and then just check the end
result - for example, if all the memory locations
involved are in the L1 cache, and none of them take a
cache miss, then by definition the ordering didn't
matter, because nobody else could have seen it.
This last point is also important: you don't actually have
to care about ordering, unless there was traffic from
another core that causes you to have to care. That's some-
what simplified, of course: one of the reasons that the
memory pipeline is the most complex part of a modern CPU
is not even SMP effects, but making sure that the memory
accesses appear "in program order" even for just the
local thread.
So notice how a CPU actually does a lot of this careful
work to make execution look serialized already. In
many ways, SMP memory ordering is not really any different.
It just means that you may have to check that the cache
lines you accessed are still under your own ownership
until you can really commit the operation, but the point
here is that you could actually do an out-of-order CPU that
treated the (very rare) ordering mistake as a micro-fault
and just replayed it.
And what is the result of that kind of advanced CPU core?
It just means that all explicit ordering is pointless.
The core already tracks the ordering requirements for
purely local reasons (ie it may have executed a store
earlier, but it can't really retire it and "commit" it
until all previous work has finished).
In many ways, memory ordering instructions can just be
seen as a sign that the CPU isn't tracking everything it
could track, so the software needs to tell it to flush
some queue explicitly. That's not really software's job,
and especially since a lot of this is dynamic, the hw could
do better if it just tracked itself instead of flushing
things unnecessarily!
But see above on the reason why you shouldn't expect sw
to do a great job. And besides - judging by past behaviour,
I suspect we'll continue to see smarter and smarter CPU's,
with less and less upside from explitic barriers. And at
some point, the upside just doesn't really exist at all,
and explicit memory barriers have only problems.
Will we get there? Maybe not. But I seriously believe that
the x86 is a better model for a reasonable future than the
alpha was (or powerpc, or mips, or ..). It has some
barriers, but they are pretty minimal (and I'm actually
told that even the read barrier may not actually be
needed by any ever shipping x86 core).
Side note: the x86 actually has "lfence" and "sfence"
barriers, and they are all about uncached but buffered
IO accesses! Once you realize that, go back and read
my point above about replay etc - that obviously will
not work in the presense of actual IO loads that have
side effects. So it turns out that memory barriers are
still needed, but they are relegated entirely to just be
about IO, not memory!
Linus