By: Michael S (already5chosen.delete@this.yahoo.com), April 19, 2017 1:49 am
Room: Moderated Discussions
Maynard Handley (name99.delete@this.name99.org) on April 18, 2017 10:04 pm wrote:
> Paul A. Clayton (paaronclayton.delete@this.gmail.com) on April 18, 2017 8:13 pm wrote:
>
> We don't have to speculate as to the value of software directed prefetch, we have some data:
> https://www.cl.cam.ac.uk/~sa614/papers/Software-Prefetching-CGO2017.pdf
>
> "novel [LLVM] compiler pass to automatically generate software prefetches for indirect memory accesses,
> a special class of irregular memory accesses often seen in high-performance workloads...
>
They are attacking a special class of irregular memory accesses.
They call it "indirect", I'd rather call it "single-indirect". Not uncommon class, yes, but less common than either "direct" or "pointer-chasing indirect".
> Across a set of memory-bound benchmarks, our automated pass achieves average speedups of 1.3×
> and 1.1× for an Intel Haswell processor and an ARM Cortex-A57, both out-of-order cores, and performance
> improvements of 2.1× and 2.7× for the in-order ARM Cortex-A53 and Intel Xeon Phi."
>
The critical point here that those are single-threaded gains.
They did multi-threaded experiments, but only on Haswell and only up to 4 cores/1 thread per core. At 4 cores the gain looks like 5-6% rather than 30%. It's very probable that at 4 cores/2 thread per core the gain will be turned into degradation.
May be, they did multi-threaded experiments on other platforms too, but decided that presenting the results wouldn't improve publishability of the paper? ;)
Also, the good gains on OoO cores came mostly from HJ-2 and HJ-8 tests, both of which are rather artificial. It is hard to judge whether they represent anything real or not.
> I'd say that qualifies as:
> - we *probably* need to provide SW-prefetch facilities that work for low-power (ie in-order) cores
> (though the paper does not discuss energy. Especially if the point of using the A53 or Zephyr is
> to save power more than to run fast, an excess of prefetches that burn more power than they save
> [eg because the CPU can halt very efficiently waiting on RAM] may not be interesting...)
> Certainly for throughput engines, the issue is unequivocal.
>
> - the win on Haswell probably tells us something about the limit of what's available beyond
> reasonable hardware prefetching. 10% is obviously a "hell yes" for something that's easily
> implemented and has all the hard work done in the compiler. It also suggests
>
> + bounds to what could further be done on the pure HW side. (Though I remain uncertain as to how good
> a job existing CPUs do of co-ordinating all the information that is cheaply available so as to ensure
> that the L1, L2 and L3 prefetchers are all on the same page. I suspect there is scope there for more
> efficient prefetching --- more timely, less overhead --- albeit not for more coverage).
>
> + perhaps you can get most of the benefit of prefetching with less hardware (and less power?)
> if you're willing to simply toss overboard anyone who isn't willing to use the compiler
> properly? This has obvious interesting implications most especially for Apple.
> Certainly it suggests that rather than smart double-indirect hardware prefetchers, as has been suggested, what
> we really want is something like a PREFETCH_INDIRECT RA+RB instruction which is immediately cracked to
> - LOAD_AND_IGNORE_PROTECTION_ETC RA+RB into non-user-visible-register RC
> - PREFETCH_IF_NOT_NULL RC
> and then have the compiler handle the problem. Obviously this sort of solution works
> a whole lot better when you have control of the compiler and, even better, a mechanism
> for just-in-time delivery of a per-micro-architecture optimized binary, something that
> is in fact true'ish and getting truer every day on the Apple/ARM side.
IMHO, the technique is too brittle for any form of static/AOH compilation. May be, something of this sort can work for JIT.
>
> I'm not sure that it's fair to bemoan a lack of software-hardware co-optimization. I think
> that co-optimization is very present on the Apple side, and somewhat present on the ARM/Android
> side (through the combination of LLVM and ART specializing to the target CPU). It's more
> something to bemoan on the x86+Windows side than a generic feature of the world.
> Paul A. Clayton (paaronclayton.delete@this.gmail.com) on April 18, 2017 8:13 pm wrote:
>
> We don't have to speculate as to the value of software directed prefetch, we have some data:
> https://www.cl.cam.ac.uk/~sa614/papers/Software-Prefetching-CGO2017.pdf
>
> "novel [LLVM] compiler pass to automatically generate software prefetches for indirect memory accesses,
> a special class of irregular memory accesses often seen in high-performance workloads...
>
They are attacking a special class of irregular memory accesses.
They call it "indirect", I'd rather call it "single-indirect". Not uncommon class, yes, but less common than either "direct" or "pointer-chasing indirect".
> Across a set of memory-bound benchmarks, our automated pass achieves average speedups of 1.3×
> and 1.1× for an Intel Haswell processor and an ARM Cortex-A57, both out-of-order cores, and performance
> improvements of 2.1× and 2.7× for the in-order ARM Cortex-A53 and Intel Xeon Phi."
>
The critical point here that those are single-threaded gains.
They did multi-threaded experiments, but only on Haswell and only up to 4 cores/1 thread per core. At 4 cores the gain looks like 5-6% rather than 30%. It's very probable that at 4 cores/2 thread per core the gain will be turned into degradation.
May be, they did multi-threaded experiments on other platforms too, but decided that presenting the results wouldn't improve publishability of the paper? ;)
Also, the good gains on OoO cores came mostly from HJ-2 and HJ-8 tests, both of which are rather artificial. It is hard to judge whether they represent anything real or not.
> I'd say that qualifies as:
> - we *probably* need to provide SW-prefetch facilities that work for low-power (ie in-order) cores
> (though the paper does not discuss energy. Especially if the point of using the A53 or Zephyr is
> to save power more than to run fast, an excess of prefetches that burn more power than they save
> [eg because the CPU can halt very efficiently waiting on RAM] may not be interesting...)
> Certainly for throughput engines, the issue is unequivocal.
>
> - the win on Haswell probably tells us something about the limit of what's available beyond
> reasonable hardware prefetching. 10% is obviously a "hell yes" for something that's easily
> implemented and has all the hard work done in the compiler. It also suggests
>
> + bounds to what could further be done on the pure HW side. (Though I remain uncertain as to how good
> a job existing CPUs do of co-ordinating all the information that is cheaply available so as to ensure
> that the L1, L2 and L3 prefetchers are all on the same page. I suspect there is scope there for more
> efficient prefetching --- more timely, less overhead --- albeit not for more coverage).
>
> + perhaps you can get most of the benefit of prefetching with less hardware (and less power?)
> if you're willing to simply toss overboard anyone who isn't willing to use the compiler
> properly? This has obvious interesting implications most especially for Apple.
> Certainly it suggests that rather than smart double-indirect hardware prefetchers, as has been suggested, what
> we really want is something like a PREFETCH_INDIRECT RA+RB instruction which is immediately cracked to
> - LOAD_AND_IGNORE_PROTECTION_ETC RA+RB into non-user-visible-register RC
> - PREFETCH_IF_NOT_NULL RC
> and then have the compiler handle the problem. Obviously this sort of solution works
> a whole lot better when you have control of the compiler and, even better, a mechanism
> for just-in-time delivery of a per-micro-architecture optimized binary, something that
> is in fact true'ish and getting truer every day on the Apple/ARM side.
IMHO, the technique is too brittle for any form of static/AOH compilation. May be, something of this sort can work for JIT.
>
> I'm not sure that it's fair to bemoan a lack of software-hardware co-optimization. I think
> that co-optimization is very present on the Apple side, and somewhat present on the ARM/Android
> side (through the combination of LLVM and ART specializing to the target CPU). It's more
> something to bemoan on the x86+Windows side than a generic feature of the world.