By: Jan Wassenberg (jan.wassenberg.delete@this.gmail.com), May 21, 2022 11:47 pm
Room: Moderated Discussions
> My point of comparison was not "x86 intrinsics applied to existing 2048-bit data layout", but "x86
> intrinsics applied to data layout designed specifically for AVX2 or for similar 256-bit SIMD ISA
> and for size of L1D and L2 caches typical on modern x86 client CPUs".
OK. FWIW I acknowledge that highly-tuned code is by definition not performance-portable and will of course do better on the CPU for which it is tuned.
> I wanted to see what exactly is JPEG XL and which parts of the codec consume majority of time [..]
> I would had guessed that majority of time on compress side consumed
> by entropy coding which is not SIMD-friendly.
From a quick look at previous VTune results, you are correct that a large part (30-40%) is the entropy coder. We could have vectorized ANS but the context model is harder. Still, 32x32 IDCT and image processing (two kinds of filters, and applying 'patches' which is might be called intra-block copy on the video side) are each only slightly less. Without SIMD, those would dominate decoding time, or rather have been too expensive to be affordable.
> intrinsics applied to data layout designed specifically for AVX2 or for similar 256-bit SIMD ISA
> and for size of L1D and L2 caches typical on modern x86 client CPUs".
OK. FWIW I acknowledge that highly-tuned code is by definition not performance-portable and will of course do better on the CPU for which it is tuned.
> I wanted to see what exactly is JPEG XL and which parts of the codec consume majority of time [..]
> I would had guessed that majority of time on compress side consumed
> by entropy coding which is not SIMD-friendly.
From a quick look at previous VTune results, you are correct that a large part (30-40%) is the entropy coder. We could have vectorized ANS but the context model is harder. Still, 32x32 IDCT and image processing (two kinds of filters, and applying 'patches' which is might be called intra-block copy on the video side) are each only slightly less. Without SIMD, those would dominate decoding time, or rather have been too expensive to be affordable.