By: Michael S (already5chosen.delete@this.yahoo.com), May 21, 2022 10:57 am
Room: Moderated Discussions
Jan Wassenberg (jan.wassenberg.delete@this.gmail.com) on May 20, 2022 10:57 pm wrote:
> Apologies, 'pre' tag in quoted text messed up formatting. Re-posting:
>
> > Michael S (already5chosen.delete@this.yahoo.com) on May 20, 2022 5:51 am wrote:
> > > -.- (blarg.delete@this.mailinator.com) on May 20, 2022 3:55 am wrote:
> > > > Why not something like:
> > > >#ifdef __AVX512F__
> > > > # define _mm(f) _mm512_##f
> > > > # define __mfloat __m512
> > > > # include "your-code-file.c"
> Oh interesting, that would work in C as well. We do something similar (re-including
> the user code) but rely on C++ function overloading. More information in case you're
> interested: https://github.com/google/highway/blob/master/g3doc/impl_details.md
>
> > Retains 100% ISA functionality unlike other SIMD abstraction layers.
> This is a bit optimistic :) For example, anything involving masks on AVX-512 is different.
>
> > Now, after reading the rest of Jan's posts, I am starting
> > to believe that in his case it is indeed that simple,
> > but only because he and his co-workers turned potentially
> > compute-bounded problem into LS bounded, losing in
> > the process factor of 2 of potential performance (2 at best,
> > if inner-loop's data set still fits in L1D, otherwise
> > the factor is bigger than 2) for sake of portability and of simplification their own work.
> Yes, it's only that simple because we have invested in the infrastructure to make it so :)
> Agreed, engineering time is usually a major constraint, and portability
> is a requirement. I'm not sure why you see a >= 2x slowdown, though:
>
> 1) compared with not having SIMD (on platforms where we couldn't justify
> hand-written arch-specific code), any kind of SIMD is a big win.
> 2) Porting existing x86 intrinsics to Highway has been at worst perf-neutral, and often better
> (when we can transparently use wider vectors, such as in the equivalent of strchr).
My point of comparison was not "x86 intrinsics applied to existing 2048-bit data layout", but "x86 intrinsics applied to data layout designed specifically for AVX2 or for similar 256-bit SIMD ISA
and for size of L1D and L2 caches typical on modern x86 client CPUs".
I wanted to see what exactly is JPEG XL and which parts of the codec consume majority of time, but
problems with site certificate stopped me.
Without knowing the details, I would had guessed that majority of time on compress side consumed
by entropy coding which is not SIMD-friendly. So performance of SIMD-friendly parts (DCT?) is relatively unimportant. Which means that you are right if you don't spend many engineering hours on that part, but also means that even if this part is not vectorized at all it wouldn't slow down codec as a whole all that much. But guesses done without knowing details are of little value :(
> Apologies, 'pre' tag in quoted text messed up formatting. Re-posting:
>
> > Michael S (already5chosen.delete@this.yahoo.com) on May 20, 2022 5:51 am wrote:
> > > -.- (blarg.delete@this.mailinator.com) on May 20, 2022 3:55 am wrote:
> > > > Why not something like:
> > > >#ifdef __AVX512F__
> > > > # define _mm(f) _mm512_##f
> > > > # define __mfloat __m512
> > > > # include "your-code-file.c"
> Oh interesting, that would work in C as well. We do something similar (re-including
> the user code) but rely on C++ function overloading. More information in case you're
> interested: https://github.com/google/highway/blob/master/g3doc/impl_details.md
>
> > Retains 100% ISA functionality unlike other SIMD abstraction layers.
> This is a bit optimistic :) For example, anything involving masks on AVX-512 is different.
>
> > Now, after reading the rest of Jan's posts, I am starting
> > to believe that in his case it is indeed that simple,
> > but only because he and his co-workers turned potentially
> > compute-bounded problem into LS bounded, losing in
> > the process factor of 2 of potential performance (2 at best,
> > if inner-loop's data set still fits in L1D, otherwise
> > the factor is bigger than 2) for sake of portability and of simplification their own work.
> Yes, it's only that simple because we have invested in the infrastructure to make it so :)
> Agreed, engineering time is usually a major constraint, and portability
> is a requirement. I'm not sure why you see a >= 2x slowdown, though:
>
> 1) compared with not having SIMD (on platforms where we couldn't justify
> hand-written arch-specific code), any kind of SIMD is a big win.
> 2) Porting existing x86 intrinsics to Highway has been at worst perf-neutral, and often better
> (when we can transparently use wider vectors, such as in the equivalent of strchr).
My point of comparison was not "x86 intrinsics applied to existing 2048-bit data layout", but "x86 intrinsics applied to data layout designed specifically for AVX2 or for similar 256-bit SIMD ISA
and for size of L1D and L2 caches typical on modern x86 client CPUs".
I wanted to see what exactly is JPEG XL and which parts of the codec consume majority of time, but
problems with site certificate stopped me.
Without knowing the details, I would had guessed that majority of time on compress side consumed
by entropy coding which is not SIMD-friendly. So performance of SIMD-friendly parts (DCT?) is relatively unimportant. Which means that you are right if you don't spend many engineering hours on that part, but also means that even if this part is not vectorized at all it wouldn't slow down codec as a whole all that much. But guesses done without knowing details are of little value :(