By: Linus Torvalds (torvalds.delete@this.linux-foundation.org), April 11, 2013 11:56 am
Room: Moderated Discussions
Eric Bron (eric.bron.delete@this.zvisuel.privatefortest.com) on April 11, 2013 11:10 am wrote:
>
> with modern cores like Ivy Bridge it's generally very frustating to toy
> with explicit prefetch since there is simply 0% speedup (no slowdown either)
We've seen slowdowns in the kernel.
Sometimes serious slowdowns.
For example, some microarchitectures do TLB fills on prefetch (which you'd think makes sense since you often do page-crossing prefetches of pointers). But then they actually seem to have trouble with the NULL pointer and slow down because the TLB fill fails and it's not zero-cost at all due to some uarch stupidity, so you have to do a conditional jump to not prefetch the end of a list. And now you slow down because the branch predicts horribly badly for the common case of short lists, and your nice cache behavior where the prefetch didn't do anything actually suffers.
And it's very annoying, because the prefetches probably made sense when they were added, and may still work fine on some machines. And on others, they are actively detrimental.
Sure, you can play games with these things - like dynamically turn them into no-ops by having instruction rewriting (so that you don't have to have runtime conditionals etc). The people who advocate sw prefetching always have a ".. but but but you could.." excuse. They never seem to get the "and what are you giving me in return for all this wasted effort" argument. When most of the time it's zero upside down the line.
In the end, I think almost every single time we added a prefetch instruction, it came back to bite us five years later and it got removed again. And most of the prefetches we still have are probably of negative actual worth, but they just haven't gotten removed, because nobody has bothered to do with performance analysis.
And yes, as you say, some of them remain because they just don't hurt (the nice array based ones with tight loops have neither I$ issues nor the above kind of TLB load issues, but they also don't tend to have any wins, since hardware does it better anyway these days).
Some of the prefetches are for things like "we know we are going to write to this, but the first access is a read, and if we do a write-prefetch we can avoid the shared state transition". So it's not actually for prefetching data per se, it's a hint to the cache state machine. And that may actually make sense (unlike actual *prefetching* it is not timing-sensitive), although I'd much rather see the OoO engine notice it on its own.
Linus
>
> with modern cores like Ivy Bridge it's generally very frustating to toy
> with explicit prefetch since there is simply 0% speedup (no slowdown either)
We've seen slowdowns in the kernel.
Sometimes serious slowdowns.
For example, some microarchitectures do TLB fills on prefetch (which you'd think makes sense since you often do page-crossing prefetches of pointers). But then they actually seem to have trouble with the NULL pointer and slow down because the TLB fill fails and it's not zero-cost at all due to some uarch stupidity, so you have to do a conditional jump to not prefetch the end of a list. And now you slow down because the branch predicts horribly badly for the common case of short lists, and your nice cache behavior where the prefetch didn't do anything actually suffers.
And it's very annoying, because the prefetches probably made sense when they were added, and may still work fine on some machines. And on others, they are actively detrimental.
Sure, you can play games with these things - like dynamically turn them into no-ops by having instruction rewriting (so that you don't have to have runtime conditionals etc). The people who advocate sw prefetching always have a ".. but but but you could.." excuse. They never seem to get the "and what are you giving me in return for all this wasted effort" argument. When most of the time it's zero upside down the line.
In the end, I think almost every single time we added a prefetch instruction, it came back to bite us five years later and it got removed again. And most of the prefetches we still have are probably of negative actual worth, but they just haven't gotten removed, because nobody has bothered to do with performance analysis.
And yes, as you say, some of them remain because they just don't hurt (the nice array based ones with tight loops have neither I$ issues nor the above kind of TLB load issues, but they also don't tend to have any wins, since hardware does it better anyway these days).
Some of the prefetches are for things like "we know we are going to write to this, but the first access is a read, and if we do a write-prefetch we can avoid the shared state transition". So it's not actually for prefetching data per se, it's a hint to the cache state machine. And that may actually make sense (unlike actual *prefetching* it is not timing-sensitive), although I'd much rather see the OoO engine notice it on its own.
Linus