By: Jan Wassenberg (jan.wassenberg.delete@this.gmail.com), May 20, 2022 1:34 am
Room: Moderated Discussions
> May be, for your jobs it is a good layout, for many of my jobs it's too coarse.
> For example, one of my important workloads is decomposition of Hermitian matrices with N in range
> 50 to 100.
Makes sense that unused vector lanes are problematic.
I wonder if it's feasible to 'transpose' your algorithm such that one vector stores N (we don't care exactly how many) of your matrix[0,0], the next vector stores N of matrix[1,0], etc?
This assumes you're processing batches of matrices. The advantage would be that you really don't care about the vector length. This is pretty much required if you ever want to run on SVE or RISC-V V. I wish we had done this in JPEG XL but the 1x64 memory layout was already too firmly established to change early on. (Still, it gives us 2048 bit blocks which is plenty for AVX-512 and enough for SVE, so it's not a terrible layout.)
> For example, one of my important workloads is decomposition of Hermitian matrices with N in range
> 50 to 100.
Makes sense that unused vector lanes are problematic.
I wonder if it's feasible to 'transpose' your algorithm such that one vector stores N (we don't care exactly how many) of your matrix[0,0], the next vector stores N of matrix[1,0], etc?
This assumes you're processing batches of matrices. The advantage would be that you really don't care about the vector length. This is pretty much required if you ever want to run on SVE or RISC-V V. I wish we had done this in JPEG XL but the 1x64 memory layout was already too firmly established to change early on. (Still, it gives us 2048 bit blocks which is plenty for AVX-512 and enough for SVE, so it's not a terrible layout.)