By: Charlie Burnes (charlie.burnes.delete@this.no-spam.com), May 19, 2022 6:29 pm
Room: Moderated Discussions
Brett (ggtgp.delete@this.yahoo.com) on May 19, 2022 4:18 pm wrote:
> Charlie Burnes (charlie.burnes.delete@this.no-spam.com) on May 19, 2022 3:05 pm wrote:
> > Even if a store followed by a load of the same address
> > is implemented with register renaming, it would still
> > use a store slot and a load slot in the execution engine and two decode slots. So it seems to me in order
> > to use two 256-bit registers to hold the contents of a
> > 512-bit register, it is best if the code is organized
> > in a way that only needs 8 512-bit registers (since AVX2 threads only have 16 256-bit registers). I would
> > be giving up some AVX-512 performance but I would avoid extra loads and stores for the AVX2 code. I think
> > most of the users of my software will have consumer processors without AVX-512 today, but I want to get
> > the extra performance from AVX-512 when it is available because
> > the problem is very compute intensive. Hopefully,
> > more processors will have high-performance implementations of AVX-512 in the future.
>
> No, you have ~90 real registers instead of ~180 with emulated AVX-512 and you only see 16.
> Yes, you want to avoid extra load/stores, so fit the code to those 16 registers you see.
Just to be sure I am understanding: Are you saying organize the code so that on a processor with AXV2, the intermediate results would use all 16 256-bit visible architecture registers and on a processor with AVX-512, the intermediate results would use 8 of the 32 512-bit visible architecture registers?
What do you mean by “No, you have ~90 real registers instead of ~180 with emulated AVX-512”?
When emulating AVX-512 on a processor without AVX-512, why do I have more real registers? I don’t understand why emulating AVX-512 changes the number of physical registers. Also, I don’t understand why the number of physical registers (as opposed to visible architecture registers) should ever be a concern of the programmer. If the instruction window for out-of-order execution is made bigger on some future processor, my code would not change but the future processor would have a bigger physical register file.
> Charlie Burnes (charlie.burnes.delete@this.no-spam.com) on May 19, 2022 3:05 pm wrote:
> > Even if a store followed by a load of the same address
> > is implemented with register renaming, it would still
> > use a store slot and a load slot in the execution engine and two decode slots. So it seems to me in order
> > to use two 256-bit registers to hold the contents of a
> > 512-bit register, it is best if the code is organized
> > in a way that only needs 8 512-bit registers (since AVX2 threads only have 16 256-bit registers). I would
> > be giving up some AVX-512 performance but I would avoid extra loads and stores for the AVX2 code. I think
> > most of the users of my software will have consumer processors without AVX-512 today, but I want to get
> > the extra performance from AVX-512 when it is available because
> > the problem is very compute intensive. Hopefully,
> > more processors will have high-performance implementations of AVX-512 in the future.
>
> No, you have ~90 real registers instead of ~180 with emulated AVX-512 and you only see 16.
> Yes, you want to avoid extra load/stores, so fit the code to those 16 registers you see.
Just to be sure I am understanding: Are you saying organize the code so that on a processor with AXV2, the intermediate results would use all 16 256-bit visible architecture registers and on a processor with AVX-512, the intermediate results would use 8 of the 32 512-bit visible architecture registers?
What do you mean by “No, you have ~90 real registers instead of ~180 with emulated AVX-512”?
When emulating AVX-512 on a processor without AVX-512, why do I have more real registers? I don’t understand why emulating AVX-512 changes the number of physical registers. Also, I don’t understand why the number of physical registers (as opposed to visible architecture registers) should ever be a concern of the programmer. If the instruction window for out-of-order execution is made bigger on some future processor, my code would not change but the future processor would have a bigger physical register file.