By: Linus Torvalds (torvalds.delete@this.linux-foundation.org), April 9, 2013 12:03 pm
Room: Moderated Discussions
Symmetry (someone.delete@this.somewhere.com) on April 9, 2013 10:41 am wrote:
> I'm also curious if letting the compiler take care of prefetching is something
> that might already exist in, say, Intel's C compiler even if it doesn't in gcc.
Software prefetching is moronic.
It's a great way to generate almost-optimal behavior on one particular CPU with one particular cache setup and memory subsystem (and one particular load), but then it falls flat on its face whenever there is some other micro-architecture or cache layout, or when you have other things going on on that same machine.
Don't do it. It's useful only for benchmarking or embedded environment where you really do control the environment well enough that it's ok to only do better under only that one particular set of circumstances.
HPC falls under that "embedded" case, btw, and is likely the only real use of prefetching being valid.
Don't trust numbers generated with software prefetching. The very paper you are pointing to should convince you of how fragile it is. Look at the constant used for prefetching. It is a magic constant of 1600 bytes ahead of the stream. Do you think that 1600 is somehow fundamental? Or do you think it might be a magic value that depends on the particular memory subsystem and micro architecture used for testing?
My guess is that the 1600 byte offset is a magic "oh, look, the hardware prefetchers work really well on this machine, but they don't prefetch across page boundaries, and 1600 bytes is the magic moment where we can help tweak things on this particular uarch".
What happens when there is a next-page prefetcher in hardware, like in IvyBridge? I wouldn't be surprised if all the wins from software prefetching go away, while the downsides remain. Which is the classic situation with software prefetching. It only wins when you get the exact pattern you tested, and then it loses on other patterns.
Btw, don't get me wrong. I think this is exactly the kind of thing that you can use a functional language to express, and generating optimized libraries from functional descriptions is cool. So I think the Haskell part is interesting. The prefetching part? Not so much.
Linus
> I'm also curious if letting the compiler take care of prefetching is something
> that might already exist in, say, Intel's C compiler even if it doesn't in gcc.
Software prefetching is moronic.
It's a great way to generate almost-optimal behavior on one particular CPU with one particular cache setup and memory subsystem (and one particular load), but then it falls flat on its face whenever there is some other micro-architecture or cache layout, or when you have other things going on on that same machine.
Don't do it. It's useful only for benchmarking or embedded environment where you really do control the environment well enough that it's ok to only do better under only that one particular set of circumstances.
HPC falls under that "embedded" case, btw, and is likely the only real use of prefetching being valid.
Don't trust numbers generated with software prefetching. The very paper you are pointing to should convince you of how fragile it is. Look at the constant used for prefetching. It is a magic constant of 1600 bytes ahead of the stream. Do you think that 1600 is somehow fundamental? Or do you think it might be a magic value that depends on the particular memory subsystem and micro architecture used for testing?
My guess is that the 1600 byte offset is a magic "oh, look, the hardware prefetchers work really well on this machine, but they don't prefetch across page boundaries, and 1600 bytes is the magic moment where we can help tweak things on this particular uarch".
What happens when there is a next-page prefetcher in hardware, like in IvyBridge? I wouldn't be surprised if all the wins from software prefetching go away, while the downsides remain. Which is the classic situation with software prefetching. It only wins when you get the exact pattern you tested, and then it loses on other patterns.
Btw, don't get me wrong. I think this is exactly the kind of thing that you can use a functional language to express, and generating optimized libraries from functional descriptions is cool. So I think the Haskell part is interesting. The prefetching part? Not so much.
Linus