By: Jan Wassenberg (jan.wassenberg.delete@this.gmail.com), May 18, 2022 10:16 pm
Room: Moderated Discussions
Brett (ggtgp.delete@this.yahoo.com) on May 18, 2022 8:57 pm wrote:
> The visible registers are almost irrelevant [..]
> The real limit is going to be dram bandwidth as I said before.
Agree on both points. We don't have control over the compiler's register allocation decisions; typically I write code with lots of vector lvalues (one per intermediate result for readability). Clang seems pretty good about avoiding unnecessary loads. The open question is how much to unroll, but we typically don't force that either and let the compiler decide, also based on register pressure for that particular target.
Register allocators still seem unable to beat hand-written in the case of GEMM kernels that (should) use every single register, but otherwise I'm happy with the results.
Bandwidth is also almost always the bottleneck in my experience. SPR-HBM might actually change that..
> The visible registers are almost irrelevant [..]
> The real limit is going to be dram bandwidth as I said before.
Agree on both points. We don't have control over the compiler's register allocation decisions; typically I write code with lots of vector lvalues (one per intermediate result for readability). Clang seems pretty good about avoiding unnecessary loads. The open question is how much to unroll, but we typically don't force that either and let the compiler decide, also based on register pressure for that particular target.
Register allocators still seem unable to beat hand-written in the case of GEMM kernels that (should) use every single register, but otherwise I'm happy with the results.
Bandwidth is also almost always the bottleneck in my experience. SPR-HBM might actually change that..