By: Paul A. Clayton (paaronclayton.delete@this.gmail.com), April 11, 2013 8:54 am
Room: Moderated Discussions
Eric Bron (eric.bron.delete@this.zvisuel.privatefortest.com) on April 11, 2013 3:58 am wrote:
[snip]
> IMO there is some fundamental reasons against explicit software prefetch (besides the fact
> that hw prefetchers render it redundant in most situations), from the top of my head:
>
> 1) It is basically a single thread thing
Actually, it is easy to conceive of cache hints (or even directives) that help under a multithreaded workload. E.g., annotating a store as the last communicating update could facilitate hardware pushing the data toward its consumer.
> 2) the IPC * clock frequency vs system memory latency wall, IPC * clock progress much faster than
> memory latency so you have to use higher and higher prefetch scheduling distances, you can use a
> variable offset (based on a training phase at startup) instead of a constant to make it future proof,
> but the hard fact is that the distance grow so much that you end up missing prefetch for the 1st
> iterations of your loops and wasting memory bandwidth for elements beyond your arrays
Some forms of prefetching are not as tightly bound to timing factors. E.g., declaring the start, end, and stride of a stream to be prefetched would seem to work fairly well with hardware. (However, the easy cases for software also tend to be the easy cases for hardware.)
In some cases system-specific optimizations can be applied (not just embedded and HPC--and some cloud services might have HPC-like characteristics--, if the software distribution format is at a higher level than machine language, install- or load-time optimizations may be practical [even some dynamic optimization may be practical in some cases] and for some code providing selectable code variations may be practical), so timing variation would be less of a problem.
Prefetching can also be useful even if it does not cover 100% of the access latency; so some timing variation can be tolerated.
It seems that the greatest difficulty is developing an architecture which facilitates software cooperating with hardware. Working within a virtualized framework (where the resources available to a task are not predictable) makes such software optimization more difficult, and it seems that existing systems do not have good interfaces for cooperatively managing resources. (This lack of cooperation applies to energy use as well as performance.)
There is presumably an analogue to "cache-oblivious" (which are really generic cache structure aware rather than specific cache structure aware) algorithms and data structures with respect to prefetching.
Fully system-aware optimization has greater potential but trades performance for complexity/flexiblity. I think such could (in a practical sense) be more broadly applied, but, like profile-guided and whole-program optimizations, I suspect such is not likely to be done even if the effort was justified.
[snip]
> IMO there is some fundamental reasons against explicit software prefetch (besides the fact
> that hw prefetchers render it redundant in most situations), from the top of my head:
>
> 1) It is basically a single thread thing
Actually, it is easy to conceive of cache hints (or even directives) that help under a multithreaded workload. E.g., annotating a store as the last communicating update could facilitate hardware pushing the data toward its consumer.
> 2) the IPC * clock frequency vs system memory latency wall, IPC * clock progress much faster than
> memory latency so you have to use higher and higher prefetch scheduling distances, you can use a
> variable offset (based on a training phase at startup) instead of a constant to make it future proof,
> but the hard fact is that the distance grow so much that you end up missing prefetch for the 1st
> iterations of your loops and wasting memory bandwidth for elements beyond your arrays
Some forms of prefetching are not as tightly bound to timing factors. E.g., declaring the start, end, and stride of a stream to be prefetched would seem to work fairly well with hardware. (However, the easy cases for software also tend to be the easy cases for hardware.)
In some cases system-specific optimizations can be applied (not just embedded and HPC--and some cloud services might have HPC-like characteristics--, if the software distribution format is at a higher level than machine language, install- or load-time optimizations may be practical [even some dynamic optimization may be practical in some cases] and for some code providing selectable code variations may be practical), so timing variation would be less of a problem.
Prefetching can also be useful even if it does not cover 100% of the access latency; so some timing variation can be tolerated.
It seems that the greatest difficulty is developing an architecture which facilitates software cooperating with hardware. Working within a virtualized framework (where the resources available to a task are not predictable) makes such software optimization more difficult, and it seems that existing systems do not have good interfaces for cooperatively managing resources. (This lack of cooperation applies to energy use as well as performance.)
There is presumably an analogue to "cache-oblivious" (which are really generic cache structure aware rather than specific cache structure aware) algorithms and data structures with respect to prefetching.
Fully system-aware optimization has greater potential but trades performance for complexity/flexiblity. I think such could (in a practical sense) be more broadly applied, but, like profile-guided and whole-program optimizations, I suspect such is not likely to be done even if the effort was justified.