By: Brett (ggtgp.delete@this.yahoo.com), May 18, 2022 8:57 pm
Room: Moderated Discussions
Charlie Burnes (charlie.burnes.delete@this.no-spam.com) on May 18, 2022 3:55 pm wrote:
> Intel’s AVX-512 has 32 512-bit registers compared to the 16 256-bit registers in AVX2. Suppose I need code
> to run on both x86 CPUs with and without AVX-512. Should I try to write the code in a way that needs only
> eight 512-bit registers so that Highway can use two 256-bit registers and two AVX2 instructions to emulate
> an AVX-512 instruction? This would minimize loads and stores on AVX2 which has only 16 256-bit registers.
The visible registers are almost irrelevant, I would say the register limit is the size of the OoO buffer divided by 2, so around 90 AVX registers?
This basically only effects poorly predicted code, which is not the case for vector code in general.
The real limit is going to be dram bandwidth as I said before.
> Intel’s AVX-512 has 32 512-bit registers compared to the 16 256-bit registers in AVX2. Suppose I need code
> to run on both x86 CPUs with and without AVX-512. Should I try to write the code in a way that needs only
> eight 512-bit registers so that Highway can use two 256-bit registers and two AVX2 instructions to emulate
> an AVX-512 instruction? This would minimize loads and stores on AVX2 which has only 16 256-bit registers.
The visible registers are almost irrelevant, I would say the register limit is the size of the OoO buffer divided by 2, so around 90 AVX registers?
This basically only effects poorly predicted code, which is not the case for vector code in general.
The real limit is going to be dram bandwidth as I said before.