By: noko (noko.delete@this.noko.com), May 23, 2022 12:29 pm
Room: Moderated Discussions
--- (---.delete@this.redheron.com) on May 21, 2022 6:59 pm wrote:
> Michael S (already5chosen.delete@this.yahoo.com) on May 21, 2022 10:57 am wrote:
> > Jan Wassenberg (jan.wassenberg.delete@this.gmail.com) on May 20, 2022 10:57 pm wrote:
> > > Apologies, 'pre' tag in quoted text messed up formatting. Re-posting:
> > >
> > > > Michael S (already5chosen.delete@this.yahoo.com) on May 20, 2022 5:51 am wrote:
> > > > > -.- (blarg.delete@this.mailinator.com) on May 20, 2022 3:55 am wrote:
> > > > > > Why not something like:
> > > > > >#ifdef __AVX512F__
> > > > > > # define _mm(f) _mm512_##f
> > > > > > # define __mfloat __m512
> > > > > > # include "your-code-file.c"
> > > Oh interesting, that would work in C as well. We do something similar (re-including
> > > the user code) but rely on C++ function overloading. More information in case you're
> > > interested: https://github.com/google/highway/blob/master/g3doc/impl_details.md
> > >
> > > > Retains 100% ISA functionality unlike other SIMD abstraction layers.
> > > This is a bit optimistic :) For example, anything involving masks on AVX-512 is different.
> > >
> > > > Now, after reading the rest of Jan's posts, I am starting
> > > > to believe that in his case it is indeed that simple,
> > > > but only because he and his co-workers turned potentially
> > > > compute-bounded problem into LS bounded, losing in
> > > > the process factor of 2 of potential performance (2 at best,
> > > > if inner-loop's data set still fits in L1D, otherwise
> > > > the factor is bigger than 2) for sake of portability and of simplification their own work.
> > > Yes, it's only that simple because we have invested in the infrastructure to make it so :)
> > > Agreed, engineering time is usually a major constraint, and portability
> > > is a requirement. I'm not sure why you see a >= 2x slowdown, though:
> > >
> > > 1) compared with not having SIMD (on platforms where we couldn't justify
> > > hand-written arch-specific code), any kind of SIMD is a big win.
> > > 2) Porting existing x86 intrinsics to Highway has been at worst perf-neutral, and often better
> > > (when we can transparently use wider vectors, such as in the equivalent of strchr).
> >
> > My point of comparison was not "x86 intrinsics applied to existing 2048-bit data layout", but "x86
> > intrinsics applied to data layout designed specifically for AVX2 or for similar 256-bit SIMD ISA
> > and for size of L1D and L2 caches typical on modern x86 client CPUs".
> >
> > I wanted to see what exactly is JPEG XL and which parts of the codec consume majority of time, but
> > problems with site certificate stopped me.
> >
> > Without knowing the details, I would had guessed that majority of time on compress side consumed
> > by entropy coding which is not SIMD-friendly. So performance of SIMD-friendly parts (DCT?) is relatively
> > unimportant. Which means that you are right if you don't spend many engineering hours on that part,
> > but also means that even if this part is not vectorized at all it wouldn't slow down codec as a
> > whole all that much. But guesses done without knowing details are of little value :(
>
> Maybe yes, maybe no, the devil is in the details.
> JPEG XL has an option for ANS as the entropy coding and Apple's
> version of NEON has ANS accelerator instructions...
> https://patents.google.com/patent/US20210072994A1
Isn't this SVE2's BEXT/BDEP?
> Michael S (already5chosen.delete@this.yahoo.com) on May 21, 2022 10:57 am wrote:
> > Jan Wassenberg (jan.wassenberg.delete@this.gmail.com) on May 20, 2022 10:57 pm wrote:
> > > Apologies, 'pre' tag in quoted text messed up formatting. Re-posting:
> > >
> > > > Michael S (already5chosen.delete@this.yahoo.com) on May 20, 2022 5:51 am wrote:
> > > > > -.- (blarg.delete@this.mailinator.com) on May 20, 2022 3:55 am wrote:
> > > > > > Why not something like:
> > > > > >#ifdef __AVX512F__
> > > > > > # define _mm(f) _mm512_##f
> > > > > > # define __mfloat __m512
> > > > > > # include "your-code-file.c"
> > > Oh interesting, that would work in C as well. We do something similar (re-including
> > > the user code) but rely on C++ function overloading. More information in case you're
> > > interested: https://github.com/google/highway/blob/master/g3doc/impl_details.md
> > >
> > > > Retains 100% ISA functionality unlike other SIMD abstraction layers.
> > > This is a bit optimistic :) For example, anything involving masks on AVX-512 is different.
> > >
> > > > Now, after reading the rest of Jan's posts, I am starting
> > > > to believe that in his case it is indeed that simple,
> > > > but only because he and his co-workers turned potentially
> > > > compute-bounded problem into LS bounded, losing in
> > > > the process factor of 2 of potential performance (2 at best,
> > > > if inner-loop's data set still fits in L1D, otherwise
> > > > the factor is bigger than 2) for sake of portability and of simplification their own work.
> > > Yes, it's only that simple because we have invested in the infrastructure to make it so :)
> > > Agreed, engineering time is usually a major constraint, and portability
> > > is a requirement. I'm not sure why you see a >= 2x slowdown, though:
> > >
> > > 1) compared with not having SIMD (on platforms where we couldn't justify
> > > hand-written arch-specific code), any kind of SIMD is a big win.
> > > 2) Porting existing x86 intrinsics to Highway has been at worst perf-neutral, and often better
> > > (when we can transparently use wider vectors, such as in the equivalent of strchr).
> >
> > My point of comparison was not "x86 intrinsics applied to existing 2048-bit data layout", but "x86
> > intrinsics applied to data layout designed specifically for AVX2 or for similar 256-bit SIMD ISA
> > and for size of L1D and L2 caches typical on modern x86 client CPUs".
> >
> > I wanted to see what exactly is JPEG XL and which parts of the codec consume majority of time, but
> > problems with site certificate stopped me.
> >
> > Without knowing the details, I would had guessed that majority of time on compress side consumed
> > by entropy coding which is not SIMD-friendly. So performance of SIMD-friendly parts (DCT?) is relatively
> > unimportant. Which means that you are right if you don't spend many engineering hours on that part,
> > but also means that even if this part is not vectorized at all it wouldn't slow down codec as a
> > whole all that much. But guesses done without knowing details are of little value :(
>
> Maybe yes, maybe no, the devil is in the details.
> JPEG XL has an option for ANS as the entropy coding and Apple's
> version of NEON has ANS accelerator instructions...
> https://patents.google.com/patent/US20210072994A1
Isn't this SVE2's BEXT/BDEP?