By: Patrick Chase (patrickjchase.delete@this.gmail.com), July 2, 2013 10:35 am
Room: Moderated Discussions
Linus Torvalds (torvalds.delete@this.linux-foundation.org) on July 1, 2013 1:40 pm wrote:
> Actual traditional array-based high-intensity FP is often fairly easy to schedule by
> the compiler (and doing cacheline blocking etc is more important than the FPU
> scheduling),
This is very true, and I'll add another prerequisite: Alias disambiguation.
Modern compilers can indeed schedule regular code (such as you posit above) very efficiently for an in-order core, but only if they can hoist loads above stores to "create" ILP in the load shadows. Properly using restrict (or __restrict__ for pre-C9x gcc) can make a huge difference, often a factor of 5 or more if the cache-blocking is done right to begin with (if the access pattern isn't cache-friendly then aliases are the least of your problems).
Silvermont's load/store pipeline is OoO even though the FP is not, so that helps in the case where the store addresses can be computed far enough ahead of time. Even so I suspect that FP-intensive code will require compile-time load hoisting for peak performance.
> and the arguably more common kind of real FPU use (which follows pointers and has
> fairly sparse arrays rather than being some unrealistic pure linpack load) is
> generally better off with the effort spent on integer and memory units.
Examples? I know of many such loads (they arise all over the place in HPC and some
areas of imaging) but none that I'd describe as "common" for desktop/mobile use.
> Integer vector units are often more useful, although the bulk of their use seems
> to be for things like crypto and memory copies, which are really just specialized
> engines that need some register space.
Also imaging. 16- or 32-bit integer math is often sufficient
> Most
> of the things that used to use vector units for actual vectors seem to be happier
> using the GPU (ie video decoding and encoding or things like photoshop effects may
> well use a vector unit, but if you can, you're generally even better off just using
> the GPU entirely and skip the vector unit).
This depends on the level of vector parallelism in the workload. For something like AVX you need tens of operations that can be performed in parallel to efficiently utilize the CPU. For a GPU you need tens of thousands.
> Actual traditional array-based high-intensity FP is often fairly easy to schedule by
> the compiler (and doing cacheline blocking etc is more important than the FPU
> scheduling),
This is very true, and I'll add another prerequisite: Alias disambiguation.
Modern compilers can indeed schedule regular code (such as you posit above) very efficiently for an in-order core, but only if they can hoist loads above stores to "create" ILP in the load shadows. Properly using restrict (or __restrict__ for pre-C9x gcc) can make a huge difference, often a factor of 5 or more if the cache-blocking is done right to begin with (if the access pattern isn't cache-friendly then aliases are the least of your problems).
Silvermont's load/store pipeline is OoO even though the FP is not, so that helps in the case where the store addresses can be computed far enough ahead of time. Even so I suspect that FP-intensive code will require compile-time load hoisting for peak performance.
> and the arguably more common kind of real FPU use (which follows pointers and has
> fairly sparse arrays rather than being some unrealistic pure linpack load) is
> generally better off with the effort spent on integer and memory units.
Examples? I know of many such loads (they arise all over the place in HPC and some
areas of imaging) but none that I'd describe as "common" for desktop/mobile use.
> Integer vector units are often more useful, although the bulk of their use seems
> to be for things like crypto and memory copies, which are really just specialized
> engines that need some register space.
Also imaging. 16- or 32-bit integer math is often sufficient
> Most
> of the things that used to use vector units for actual vectors seem to be happier
> using the GPU (ie video decoding and encoding or things like photoshop effects may
> well use a vector unit, but if you can, you're generally even better off just using
> the GPU entirely and skip the vector unit).
This depends on the level of vector parallelism in the workload. For something like AVX you need tens of operations that can be performed in parallel to efficiently utilize the CPU. For a GPU you need tens of thousands.