By: --- (---.delete@this.redheron.com), May 21, 2022 6:59 pm
Room: Moderated Discussions
Michael S (already5chosen.delete@this.yahoo.com) on May 21, 2022 10:57 am wrote:
> Jan Wassenberg (jan.wassenberg.delete@this.gmail.com) on May 20, 2022 10:57 pm wrote:
> > Apologies, 'pre' tag in quoted text messed up formatting. Re-posting:
> >
> > > Michael S (already5chosen.delete@this.yahoo.com) on May 20, 2022 5:51 am wrote:
> > > > -.- (blarg.delete@this.mailinator.com) on May 20, 2022 3:55 am wrote:
> > > > > Why not something like:
> > > > >#ifdef __AVX512F__
> > > > > # define _mm(f) _mm512_##f
> > > > > # define __mfloat __m512
> > > > > # include "your-code-file.c"
> > Oh interesting, that would work in C as well. We do something similar (re-including
> > the user code) but rely on C++ function overloading. More information in case you're
> > interested: https://github.com/google/highway/blob/master/g3doc/impl_details.md
> >
> > > Retains 100% ISA functionality unlike other SIMD abstraction layers.
> > This is a bit optimistic :) For example, anything involving masks on AVX-512 is different.
> >
> > > Now, after reading the rest of Jan's posts, I am starting
> > > to believe that in his case it is indeed that simple,
> > > but only because he and his co-workers turned potentially
> > > compute-bounded problem into LS bounded, losing in
> > > the process factor of 2 of potential performance (2 at best,
> > > if inner-loop's data set still fits in L1D, otherwise
> > > the factor is bigger than 2) for sake of portability and of simplification their own work.
> > Yes, it's only that simple because we have invested in the infrastructure to make it so :)
> > Agreed, engineering time is usually a major constraint, and portability
> > is a requirement. I'm not sure why you see a >= 2x slowdown, though:
> >
> > 1) compared with not having SIMD (on platforms where we couldn't justify
> > hand-written arch-specific code), any kind of SIMD is a big win.
> > 2) Porting existing x86 intrinsics to Highway has been at worst perf-neutral, and often better
> > (when we can transparently use wider vectors, such as in the equivalent of strchr).
>
> My point of comparison was not "x86 intrinsics applied to existing 2048-bit data layout", but "x86
> intrinsics applied to data layout designed specifically for AVX2 or for similar 256-bit SIMD ISA
> and for size of L1D and L2 caches typical on modern x86 client CPUs".
>
> I wanted to see what exactly is JPEG XL and which parts of the codec consume majority of time, but
> problems with site certificate stopped me.
>
> Without knowing the details, I would had guessed that majority of time on compress side consumed
> by entropy coding which is not SIMD-friendly. So performance of SIMD-friendly parts (DCT?) is relatively
> unimportant. Which means that you are right if you don't spend many engineering hours on that part,
> but also means that even if this part is not vectorized at all it wouldn't slow down codec as a
> whole all that much. But guesses done without knowing details are of little value :(
Maybe yes, maybe no, the devil is in the details.
JPEG XL has an option for ANS as the entropy coding and Apple's version of NEON has ANS accelerator instructions...
https://patents.google.com/patent/US20210072994A1
> Jan Wassenberg (jan.wassenberg.delete@this.gmail.com) on May 20, 2022 10:57 pm wrote:
> > Apologies, 'pre' tag in quoted text messed up formatting. Re-posting:
> >
> > > Michael S (already5chosen.delete@this.yahoo.com) on May 20, 2022 5:51 am wrote:
> > > > -.- (blarg.delete@this.mailinator.com) on May 20, 2022 3:55 am wrote:
> > > > > Why not something like:
> > > > >#ifdef __AVX512F__
> > > > > # define _mm(f) _mm512_##f
> > > > > # define __mfloat __m512
> > > > > # include "your-code-file.c"
> > Oh interesting, that would work in C as well. We do something similar (re-including
> > the user code) but rely on C++ function overloading. More information in case you're
> > interested: https://github.com/google/highway/blob/master/g3doc/impl_details.md
> >
> > > Retains 100% ISA functionality unlike other SIMD abstraction layers.
> > This is a bit optimistic :) For example, anything involving masks on AVX-512 is different.
> >
> > > Now, after reading the rest of Jan's posts, I am starting
> > > to believe that in his case it is indeed that simple,
> > > but only because he and his co-workers turned potentially
> > > compute-bounded problem into LS bounded, losing in
> > > the process factor of 2 of potential performance (2 at best,
> > > if inner-loop's data set still fits in L1D, otherwise
> > > the factor is bigger than 2) for sake of portability and of simplification their own work.
> > Yes, it's only that simple because we have invested in the infrastructure to make it so :)
> > Agreed, engineering time is usually a major constraint, and portability
> > is a requirement. I'm not sure why you see a >= 2x slowdown, though:
> >
> > 1) compared with not having SIMD (on platforms where we couldn't justify
> > hand-written arch-specific code), any kind of SIMD is a big win.
> > 2) Porting existing x86 intrinsics to Highway has been at worst perf-neutral, and often better
> > (when we can transparently use wider vectors, such as in the equivalent of strchr).
>
> My point of comparison was not "x86 intrinsics applied to existing 2048-bit data layout", but "x86
> intrinsics applied to data layout designed specifically for AVX2 or for similar 256-bit SIMD ISA
> and for size of L1D and L2 caches typical on modern x86 client CPUs".
>
> I wanted to see what exactly is JPEG XL and which parts of the codec consume majority of time, but
> problems with site certificate stopped me.
>
> Without knowing the details, I would had guessed that majority of time on compress side consumed
> by entropy coding which is not SIMD-friendly. So performance of SIMD-friendly parts (DCT?) is relatively
> unimportant. Which means that you are right if you don't spend many engineering hours on that part,
> but also means that even if this part is not vectorized at all it wouldn't slow down codec as a
> whole all that much. But guesses done without knowing details are of little value :(
Maybe yes, maybe no, the devil is in the details.
JPEG XL has an option for ANS as the entropy coding and Apple's version of NEON has ANS accelerator instructions...
https://patents.google.com/patent/US20210072994A1