By: Michael S (already5chosen.delete@this.yahoo.com), May 22, 2022 5:12 am
Room: Moderated Discussions
Jan Wassenberg (jan.wassenberg.delete@this.gmail.com) on May 22, 2022 4:06 am wrote:
> Michael S (already5chosen.delete@this.yahoo.com) on May 22, 2022 1:40 am wrote:
> > You see! I didn't even know that JPEG XL has something called "video mode".
> :) Video is not really a special mode, it's a feature of the container
> more than the codec, and is intended for short animations.
>
> > As to IDCT, if you say it's 32x32 then I'd like to look into details of your implementation.
> Sure, it's open-source: https://github.com/libjxl/libjxl/blob/main/lib/jxl/dct-inl.h
> It actually supports a much wider range, from less than 8x8 to 256x256.
>
> > If it's really done as you said above, i.e. you interleave 64 32x32 blocks
> Ah, there is a misunderstanding. I mentioned I would have liked to try that, what we actually
> did is a 1x64 layout, so the DC coefficient first, then all AC coefficients of an 8x8
> block are stored consecutively. This requires us to mask out the DC lane in most processing,
> and some caution is required with wider vectors, but it works decently.
>
> > Do you have some absolute performance figures for your all-cores
> > IDCT throughput on one of CPUs from above-mentioned class?
> I do not, and have switched teams, but IIRC it was possible to decode 4K60p with a sufficent number of cores.
Let's do back-of-envelop math.
4K*60p ~= 500 Mpixel/s. Ignoring color for sake of brevity it means ~500,000 32x32 blocks per second.
32x32 DCT/IDCT ~= 13,000 FLOP. 500,000*13,000 = 6.5 GFLOPs == slightly less than *scalar* performance of 1 Intel client core.
Now, if "sufficient number of cores" ==4, it means that we want something a little better than scalar IDCT, because 25% of CPU resources sound like significant. But if "sufficient number of cores" ==8 then we shouldn't bother ourselves with SIMD, at least on IDCT part, because 12% sound like insignificant.
For 8x8 IDCT my answer is "do not bother with SIMD" even for 4 cores.
> Michael S (already5chosen.delete@this.yahoo.com) on May 22, 2022 1:40 am wrote:
> > You see! I didn't even know that JPEG XL has something called "video mode".
> :) Video is not really a special mode, it's a feature of the container
> more than the codec, and is intended for short animations.
>
> > As to IDCT, if you say it's 32x32 then I'd like to look into details of your implementation.
> Sure, it's open-source: https://github.com/libjxl/libjxl/blob/main/lib/jxl/dct-inl.h
> It actually supports a much wider range, from less than 8x8 to 256x256.
>
> > If it's really done as you said above, i.e. you interleave 64 32x32 blocks
> Ah, there is a misunderstanding. I mentioned I would have liked to try that, what we actually
> did is a 1x64 layout, so the DC coefficient first, then all AC coefficients of an 8x8
> block are stored consecutively. This requires us to mask out the DC lane in most processing,
> and some caution is required with wider vectors, but it works decently.
>
> > Do you have some absolute performance figures for your all-cores
> > IDCT throughput on one of CPUs from above-mentioned class?
> I do not, and have switched teams, but IIRC it was possible to decode 4K60p with a sufficent number of cores.
Let's do back-of-envelop math.
4K*60p ~= 500 Mpixel/s. Ignoring color for sake of brevity it means ~500,000 32x32 blocks per second.
32x32 DCT/IDCT ~= 13,000 FLOP. 500,000*13,000 = 6.5 GFLOPs == slightly less than *scalar* performance of 1 Intel client core.
Now, if "sufficient number of cores" ==4, it means that we want something a little better than scalar IDCT, because 25% of CPU resources sound like significant. But if "sufficient number of cores" ==8 then we shouldn't bother ourselves with SIMD, at least on IDCT part, because 12% sound like insignificant.
For 8x8 IDCT my answer is "do not bother with SIMD" even for 4 cores.