By: Jan Wassenberg (jan.wassenberg.delete@this.gmail.com), May 22, 2022 4:06 am
Room: Moderated Discussions
Michael S (already5chosen.delete@this.yahoo.com) on May 22, 2022 1:40 am wrote:
> You see! I didn't even know that JPEG XL has something called "video mode".
:) Video is not really a special mode, it's a feature of the container more than the codec, and is intended for short animations.
> As to IDCT, if you say it's 32x32 then I'd like to look into details of your implementation.
Sure, it's open-source: https://github.com/libjxl/libjxl/blob/main/lib/jxl/dct-inl.h
It actually supports a much wider range, from less than 8x8 to 256x256.
> If it's really done as you said above, i.e. you interleave 64 32x32 blocks
Ah, there is a misunderstanding. I mentioned I would have liked to try that, what we actually did is a 1x64 layout, so the DC coefficient first, then all AC coefficients of an 8x8 block are stored consecutively. This requires us to mask out the DC lane in most processing, and some caution is required with wider vectors, but it works decently.
> Do you have some absolute performance figures for your all-cores
> IDCT throughput on one of CPUs from above-mentioned class?
I do not, and have switched teams, but IIRC it was possible to decode 4K60p with a sufficent number of cores.
> You see! I didn't even know that JPEG XL has something called "video mode".
:) Video is not really a special mode, it's a feature of the container more than the codec, and is intended for short animations.
> As to IDCT, if you say it's 32x32 then I'd like to look into details of your implementation.
Sure, it's open-source: https://github.com/libjxl/libjxl/blob/main/lib/jxl/dct-inl.h
It actually supports a much wider range, from less than 8x8 to 256x256.
> If it's really done as you said above, i.e. you interleave 64 32x32 blocks
Ah, there is a misunderstanding. I mentioned I would have liked to try that, what we actually did is a 1x64 layout, so the DC coefficient first, then all AC coefficients of an 8x8 block are stored consecutively. This requires us to mask out the DC lane in most processing, and some caution is required with wider vectors, but it works decently.
> Do you have some absolute performance figures for your all-cores
> IDCT throughput on one of CPUs from above-mentioned class?
I do not, and have switched teams, but IIRC it was possible to decode 4K60p with a sufficent number of cores.