By: Charlie Burnes (charlie.burnes.delete@this.no-spam.com), May 19, 2022 3:05 pm
Room: Moderated Discussions
Even if a store followed by a load of the same address is implemented with register renaming, it would still use a store slot and a load slot in the execution engine and two decode slots. So it seems to me in order to use two 256-bit registers to hold the contents of a 512-bit register, it is best if the code is organized in a way that only needs 8 512-bit registers (since AVX2 threads only have 16 256-bit registers). I would be giving up some AVX-512 performance but I would avoid extra loads and stores for the AVX2 code. I think most of the users of my software will have consumer processors without AVX-512 today, but I want to get the extra performance from AVX-512 when it is available because the problem is very compute intensive. Hopefully, more processors will have high-performance implementations of AVX-512 in the future.