By: RichardC (tich.delete@this.pobox.com), November 18, 2020 4:42 pm
Room: Moderated Discussions
Dummond D. Slow (mental.delete@this.protozoa.us) on November 18, 2020 11:06 am wrote:
> I already said that elsewhere: autovectorization mostly fails on the kind of integer SIMD routines the
> encoders use. It is generally considered not remotely usable for encoders. One reason is that to get
> the performance, you usually can't just SIMDify naively, it needs some restructuring and transformation
> of the computation to get the kinds of speedups assembly does. It's probably not the only factor.
In general, to get close to the highest possible performance, you may need to change
the data layout to match the access patterns needed for the particular ISA to work
well in the critical inner loops.
But the C and C++ language standards specify the data layout of classes/struct's and
arrays, so they simply aren't allowed to do the kinds of optimizations that are needed.
And they also constrain the sequence of operations in awkward ways, though there are
often kludges to get round that.
Autovectorization is inescapably half-assed. To do a really good job of optimization,
you would need to start from a specification in a language with a higher level of abstraction which doesn't over-constrain the data layout.
This is precisely the kind of thing that happens in generating optimized code for
database query evaluation or ML inference, where you're starting from a much more
abstract specification of the computation.
> I already said that elsewhere: autovectorization mostly fails on the kind of integer SIMD routines the
> encoders use. It is generally considered not remotely usable for encoders. One reason is that to get
> the performance, you usually can't just SIMDify naively, it needs some restructuring and transformation
> of the computation to get the kinds of speedups assembly does. It's probably not the only factor.
In general, to get close to the highest possible performance, you may need to change
the data layout to match the access patterns needed for the particular ISA to work
well in the critical inner loops.
But the C and C++ language standards specify the data layout of classes/struct's and
arrays, so they simply aren't allowed to do the kinds of optimizations that are needed.
And they also constrain the sequence of operations in awkward ways, though there are
often kludges to get round that.
Autovectorization is inescapably half-assed. To do a really good job of optimization,
you would need to start from a specification in a language with a higher level of abstraction which doesn't over-constrain the data layout.
This is precisely the kind of thing that happens in generating optimized code for
database query evaluation or ML inference, where you're starting from a much more
abstract specification of the computation.