By: Michael S (already5chosen.delete@this.yahoo.com), May 22, 2022 1:40 am
Room: Moderated Discussions
Jan Wassenberg (jan.wassenberg.delete@this.gmail.com) on May 21, 2022 11:47 pm wrote:
> > My point of comparison was not "x86 intrinsics applied to existing 2048-bit data layout", but "x86
> > intrinsics applied to data layout designed specifically for AVX2 or for similar 256-bit SIMD ISA
> > and for size of L1D and L2 caches typical on modern x86 client CPUs".
> OK. FWIW I acknowledge that highly-tuned code is by definition not performance-portable
> and will of course do better on the CPU for which it is tuned.
>
> > I wanted to see what exactly is JPEG XL and which parts of the codec consume majority of time [..]
> > I would had guessed that majority of time on compress side consumed
> > by entropy coding which is not SIMD-friendly.
> From a quick look at previous VTune results, you are correct that a large part (30-40%) is
> the entropy coder. We could have vectorized ANS but the context model is harder. Still, 32x32
> IDCT and image processing (two kinds of filters, and applying 'patches' which is might be
> called intra-block copy on the video side) are each only slightly less. Without SIMD, those
> would dominate decoding time, or rather have been too expensive to be affordable.
You see! I didn't even know that JPEG XL has something called "video mode".
As to IDCT, if you say it's 32x32 then I'd like to look into details of your implementation.
If it's really done as you said above, i.e. you interleave 64 32x32 blocks and then process them togeter, then I am not sure at all that it is faster than just doing things scalarly, one 32x32 block at time.
Esp. if we consider all-cores throughput on Intel Client CPUs of few previous generations, i.e. Haswell, Broadwell and a dozen or so of Skylake Client variants under different code names.
The big potential problem here is that 32*32*64*4 bytes = 256KB, i.e. exactly the size of L2 cache of these extremely popular CPUs. Together with twiddle factors, it either barely fits, which is tolerable, or does not fit, which means good buy to any sort of scaling with # of cores.
Do you have some absolute performance figures for your all-cores IDCT throughput on one of CPUs from above-mentioned class?
> > My point of comparison was not "x86 intrinsics applied to existing 2048-bit data layout", but "x86
> > intrinsics applied to data layout designed specifically for AVX2 or for similar 256-bit SIMD ISA
> > and for size of L1D and L2 caches typical on modern x86 client CPUs".
> OK. FWIW I acknowledge that highly-tuned code is by definition not performance-portable
> and will of course do better on the CPU for which it is tuned.
>
> > I wanted to see what exactly is JPEG XL and which parts of the codec consume majority of time [..]
> > I would had guessed that majority of time on compress side consumed
> > by entropy coding which is not SIMD-friendly.
> From a quick look at previous VTune results, you are correct that a large part (30-40%) is
> the entropy coder. We could have vectorized ANS but the context model is harder. Still, 32x32
> IDCT and image processing (two kinds of filters, and applying 'patches' which is might be
> called intra-block copy on the video side) are each only slightly less. Without SIMD, those
> would dominate decoding time, or rather have been too expensive to be affordable.
You see! I didn't even know that JPEG XL has something called "video mode".
As to IDCT, if you say it's 32x32 then I'd like to look into details of your implementation.
If it's really done as you said above, i.e. you interleave 64 32x32 blocks and then process them togeter, then I am not sure at all that it is faster than just doing things scalarly, one 32x32 block at time.
Esp. if we consider all-cores throughput on Intel Client CPUs of few previous generations, i.e. Haswell, Broadwell and a dozen or so of Skylake Client variants under different code names.
The big potential problem here is that 32*32*64*4 bytes = 256KB, i.e. exactly the size of L2 cache of these extremely popular CPUs. Together with twiddle factors, it either barely fits, which is tolerable, or does not fit, which means good buy to any sort of scaling with # of cores.
Do you have some absolute performance figures for your all-cores IDCT throughput on one of CPUs from above-mentioned class?