By: Brett (ggtgp.delete@this.yahoo.com), May 19, 2022 10:38 pm
Room: Moderated Discussions
Jan Wassenberg (jan.wassenberg.delete@this.gmail.com) on May 19, 2022 9:12 pm wrote:
> Charlie Burnes (charlie.burnes.delete@this.no-spam.com) on May 19, 2022 7:29 pm wrote:
> > > No, you have ~90 real registers instead of ~180 with emulated AVX-512 and you only see 16.
> > > Yes, you want to avoid extra load/stores, so fit the code to those 16 registers you see.
> > Just to be sure I am understanding: Are you saying organize the code so that on a processor with AXV2,
> > the intermediate results would use all 16 256-bit visible architecture registers and on a processor with
> > AVX-512, the intermediate results would use 8 of the 32 512-bit visible architecture registers?
>
> We might have a three-way "talking past each other" here :)
> The path I'd generally recommend is: 1) imagine your vectors are tuples of unknown length;
> 2) write your algorithm such that intermediate results (whether used once or multiple times) are lvalues;
> 3) let the compiler worry about mapping those to the 16 (architecturally) visible AVX2 regs.
>
> The open question is how many intermediate results we produce, and the main influence on that is
> whether/how often we unroll loops. If your computation involves 8 live regs, unrolling 4x would not
> be a good idea for AVX2 because it will require the compiler to generate load/store spills. (Yes,
> the CPU has more physical regs, but we still require the loads/stores because we cannot address those
> regs directly.) We can also let the compiler make per-target decisions on how to unroll, and then
> you're using more/most of the AVX-512 regs, without requiring excessive spills on AVX2.
Yes, use all 16 registers you see always and let the hardware figure it out.
Does not matter if the real registers underneath are 512, or 256, or 128, or even 64 bits wide.
> Charlie Burnes (charlie.burnes.delete@this.no-spam.com) on May 19, 2022 7:29 pm wrote:
> > > No, you have ~90 real registers instead of ~180 with emulated AVX-512 and you only see 16.
> > > Yes, you want to avoid extra load/stores, so fit the code to those 16 registers you see.
> > Just to be sure I am understanding: Are you saying organize the code so that on a processor with AXV2,
> > the intermediate results would use all 16 256-bit visible architecture registers and on a processor with
> > AVX-512, the intermediate results would use 8 of the 32 512-bit visible architecture registers?
>
> We might have a three-way "talking past each other" here :)
> The path I'd generally recommend is: 1) imagine your vectors are tuples of unknown length;
> 2) write your algorithm such that intermediate results (whether used once or multiple times) are lvalues;
> 3) let the compiler worry about mapping those to the 16 (architecturally) visible AVX2 regs.
>
> The open question is how many intermediate results we produce, and the main influence on that is
> whether/how often we unroll loops. If your computation involves 8 live regs, unrolling 4x would not
> be a good idea for AVX2 because it will require the compiler to generate load/store spills. (Yes,
> the CPU has more physical regs, but we still require the loads/stores because we cannot address those
> regs directly.) We can also let the compiler make per-target decisions on how to unroll, and then
> you're using more/most of the AVX-512 regs, without requiring excessive spills on AVX2.
Yes, use all 16 registers you see always and let the hardware figure it out.
Does not matter if the real registers underneath are 512, or 256, or 128, or even 64 bits wide.