By: Brett (ggtgp.delete@this.yahoo.com), May 19, 2022 4:18 pm
Room: Moderated Discussions
Charlie Burnes (charlie.burnes.delete@this.no-spam.com) on May 19, 2022 3:05 pm wrote:
> Even if a store followed by a load of the same address is implemented with register renaming, it would still
> use a store slot and a load slot in the execution engine and two decode slots. So it seems to me in order
> to use two 256-bit registers to hold the contents of a 512-bit register, it is best if the code is organized
> in a way that only needs 8 512-bit registers (since AVX2 threads only have 16 256-bit registers). I would
> be giving up some AVX-512 performance but I would avoid extra loads and stores for the AVX2 code. I think
> most of the users of my software will have consumer processors without AVX-512 today, but I want to get
> the extra performance from AVX-512 when it is available because the problem is very compute intensive. Hopefully,
> more processors will have high-performance implementations of AVX-512 in the future.
No, you have ~90 real registers instead of ~180 with emulated AVX-512 and you only see 16.
Yes, you want to avoid extra load/stores, so fit the code to those 16 registers you see.
> Even if a store followed by a load of the same address is implemented with register renaming, it would still
> use a store slot and a load slot in the execution engine and two decode slots. So it seems to me in order
> to use two 256-bit registers to hold the contents of a 512-bit register, it is best if the code is organized
> in a way that only needs 8 512-bit registers (since AVX2 threads only have 16 256-bit registers). I would
> be giving up some AVX-512 performance but I would avoid extra loads and stores for the AVX2 code. I think
> most of the users of my software will have consumer processors without AVX-512 today, but I want to get
> the extra performance from AVX-512 when it is available because the problem is very compute intensive. Hopefully,
> more processors will have high-performance implementations of AVX-512 in the future.
No, you have ~90 real registers instead of ~180 with emulated AVX-512 and you only see 16.
Yes, you want to avoid extra load/stores, so fit the code to those 16 registers you see.