By: Brendan (btrotter.delete@this.gmail.com), April 11, 2013 7:06 am
Room: Moderated Discussions
Hi,
Eric Bron (eric.bron.delete@this.zvisuel.privatefortest.com) on April 11, 2013 3:58 am wrote:
> IMO there is some fundamental reasons against explicit software prefetch (besides the fact
> that hw prefetchers render it redundant in most situations), from the top of my head:
>
> 1) It is basically a single thread thing
A logical CPU can only run one software thread at a time. Switching between threads will upset the software prefetching and the hardware prefetching, but that's only a symptom of a larger problem (changing working sets) and not a problem with software or hardware prefetching itself.
> 2) the IPC * clock frequency vs system memory latency wall, IPC * clock progress much faster than
> memory latency so you have to use higher and higher prefetch scheduling distances, you can use a
> variable offset (based on a training phase at startup) instead of a constant to make it future proof,
> but the hard fact is that the distance grow so much that you end up missing prefetch for the 1st
> iterations of your loops and wasting memory bandwidth for elements beyond your arrays
This is a generic problem with optimisation - if you don't know much about the target, then you can't do compile-time optimisations to suit that specific target. It doesn't matter if the optimisation is prefetch scheduling distance, or using "SSE version X", or optimising for whatever decoding rules the target uses, or anything else.
This is not a problem with software prefetching, it's a problem caused by distributing software as pre-compiled native binaries. A better idea is "compile before run" - e.g. shipping software as some sort of byte-code (LLVM, .NET/CIL) and compiling it for the specific system when it's installed on that specific system.
- Brendan
Eric Bron (eric.bron.delete@this.zvisuel.privatefortest.com) on April 11, 2013 3:58 am wrote:
> IMO there is some fundamental reasons against explicit software prefetch (besides the fact
> that hw prefetchers render it redundant in most situations), from the top of my head:
>
> 1) It is basically a single thread thing
A logical CPU can only run one software thread at a time. Switching between threads will upset the software prefetching and the hardware prefetching, but that's only a symptom of a larger problem (changing working sets) and not a problem with software or hardware prefetching itself.
> 2) the IPC * clock frequency vs system memory latency wall, IPC * clock progress much faster than
> memory latency so you have to use higher and higher prefetch scheduling distances, you can use a
> variable offset (based on a training phase at startup) instead of a constant to make it future proof,
> but the hard fact is that the distance grow so much that you end up missing prefetch for the 1st
> iterations of your loops and wasting memory bandwidth for elements beyond your arrays
This is a generic problem with optimisation - if you don't know much about the target, then you can't do compile-time optimisations to suit that specific target. It doesn't matter if the optimisation is prefetch scheduling distance, or using "SSE version X", or optimising for whatever decoding rules the target uses, or anything else.
This is not a problem with software prefetching, it's a problem caused by distributing software as pre-compiled native binaries. A better idea is "compile before run" - e.g. shipping software as some sort of byte-code (LLVM, .NET/CIL) and compiling it for the specific system when it's installed on that specific system.
- Brendan