By: Michael S (already5chosen.delete@this.yahoo.com), May 20, 2022 2:48 am
Room: Moderated Discussions
Jan Wassenberg (jan.wassenberg.delete@this.gmail.com) on May 20, 2022 1:34 am wrote:
> > May be, for your jobs it is a good layout, for many of my jobs it's too coarse.
> > For example, one of my important workloads is decomposition of Hermitian matrices with N in range
> > 50 to 100.
> Makes sense that unused vector lanes are problematic.
> I wonder if it's feasible to 'transpose' your algorithm such that one vector stores N (we don't
> care exactly how many) of your matrix[0,0], the next vector stores N of matrix[1,0], etc?
> This assumes you're processing batches of matrices.
Sounds like GPU-stile transformation on CPU. I'd guess, if I was wanting GPU then I'd use GPU.
You suggestion would be horrible in my case for more than a single reason.
1. Not enough of independent jobs to feel both all cores and all SIMD lanes of each core independently. Also, algorithm flow is not that regular. Decompositions are done in dependent series, with number of iteration in each series unknown at the beginning. Majority of series is of length of 1 or 2, but some are longer and they are most important from application's perspective.
2. It would blow caches. L1D hit rate will be close to zero. On something like Skylake Client, which is the most important target, L2 hit rate will be rather poor too. So, all cores will compete for L3 bandwidth, which is not sufficient to feed even 1 core adequately, much less so 6 or 8.
Overall, I'd expect that implementing your suggestion will reduce our AVX2 all-cores throughput by factor of 5 at least, likely more than that. So, if after that AVX512 gives us a factor of 1.7 back, it's a very small consolation.
> The advantage would be that you really don't care about
> the vector length. This is pretty much required if you ever want to run on SVE or RISC-V V.
I certainly don't care about RISC-V V and don't expect to care before retirement.
As to SVE, I don't see why it is required. If I would be interested, most likely I would be interested about specific implementation. And it's pretty much guaranteed that VL of implementation will be one of the three: 128, 256 or 512.
> I wish we had done
> this in JPEG XL but the 1x64 memory layout was already too firmly established to change early on. (Still, it
> gives us 2048 bit blocks which is plenty for AVX-512 and enough for SVE, so it's not a terrible layout.)
> > May be, for your jobs it is a good layout, for many of my jobs it's too coarse.
> > For example, one of my important workloads is decomposition of Hermitian matrices with N in range
> > 50 to 100.
> Makes sense that unused vector lanes are problematic.
> I wonder if it's feasible to 'transpose' your algorithm such that one vector stores N (we don't
> care exactly how many) of your matrix[0,0], the next vector stores N of matrix[1,0], etc?
> This assumes you're processing batches of matrices.
Sounds like GPU-stile transformation on CPU. I'd guess, if I was wanting GPU then I'd use GPU.
You suggestion would be horrible in my case for more than a single reason.
1. Not enough of independent jobs to feel both all cores and all SIMD lanes of each core independently. Also, algorithm flow is not that regular. Decompositions are done in dependent series, with number of iteration in each series unknown at the beginning. Majority of series is of length of 1 or 2, but some are longer and they are most important from application's perspective.
2. It would blow caches. L1D hit rate will be close to zero. On something like Skylake Client, which is the most important target, L2 hit rate will be rather poor too. So, all cores will compete for L3 bandwidth, which is not sufficient to feed even 1 core adequately, much less so 6 or 8.
Overall, I'd expect that implementing your suggestion will reduce our AVX2 all-cores throughput by factor of 5 at least, likely more than that. So, if after that AVX512 gives us a factor of 1.7 back, it's a very small consolation.
> The advantage would be that you really don't care about
> the vector length. This is pretty much required if you ever want to run on SVE or RISC-V V.
I certainly don't care about RISC-V V and don't expect to care before retirement.
As to SVE, I don't see why it is required. If I would be interested, most likely I would be interested about specific implementation. And it's pretty much guaranteed that VL of implementation will be one of the three: 128, 256 or 512.
> I wish we had done
> this in JPEG XL but the 1x64 memory layout was already too firmly established to change early on. (Still, it
> gives us 2048 bit blocks which is plenty for AVX-512 and enough for SVE, so it's not a terrible layout.)