By: Anon (no.delete@this.spam.com), November 8, 2022 6:31 pm
Room: Moderated Discussions
Adrian (a.delete@this.acm.org) on November 8, 2022 8:53 am wrote:
> No, it does not show this with certainty. More tests are necessary.
>
> The sequence FMA512, FMA256, FMA512, FMA256 ... could be reordered as FMA512, FMA512,
> FMA256, FMA256 ... and executed in 3 clock cycles by processing the halves of a 512-bit
> operand in the same cycle and two 2565-bit operations also in a single cycle.
Make the operations dependent, each stream of dependent single cycle AVX256 instructions would consume one port for full throughput, if AVX512 were executed by coupling units then adding a fez AVX512 instructions would increase the latency of the AVX256 stream, if AVX512 is executed serially the latency of the dependent AVX256 instructions wouldn't change. ON Zen 3 VPAVGB has a throughput of 2 per clock and latency of 1, would be perfect for this test if Zen 4 keeps the same throughput.
> I am skeptical that AMD has chosen the variant with sequential processing of the halves, because
> that creates problems for the few instructions that need to access both halves. I would not have
> chosen this variant, because I do not believe that it has any advantage in cost or performance over
> the multiple alternatives that can process simultaneously the two halves of a 512-bit operand.
Which problems? Keep in mind that the units may keep intermediate results.
I think the simultaneously execution is extremely unlikely, AMD does not uses a unified scheduller (unlike Intel), executing an instruction in both units would require touching both FPU schedullers, how would you implement this? I think the serially execution is more likely because the implementation would be much simpler, and AMD already implemented things like that before, serially executing a vector is trivial compared to what they have already done.
> No, it does not show this with certainty. More tests are necessary.
>
> The sequence FMA512, FMA256, FMA512, FMA256 ... could be reordered as FMA512, FMA512,
> FMA256, FMA256 ... and executed in 3 clock cycles by processing the halves of a 512-bit
> operand in the same cycle and two 2565-bit operations also in a single cycle.
Make the operations dependent, each stream of dependent single cycle AVX256 instructions would consume one port for full throughput, if AVX512 were executed by coupling units then adding a fez AVX512 instructions would increase the latency of the AVX256 stream, if AVX512 is executed serially the latency of the dependent AVX256 instructions wouldn't change. ON Zen 3 VPAVGB has a throughput of 2 per clock and latency of 1, would be perfect for this test if Zen 4 keeps the same throughput.
> I am skeptical that AMD has chosen the variant with sequential processing of the halves, because
> that creates problems for the few instructions that need to access both halves. I would not have
> chosen this variant, because I do not believe that it has any advantage in cost or performance over
> the multiple alternatives that can process simultaneously the two halves of a 512-bit operand.
Which problems? Keep in mind that the units may keep intermediate results.
I think the simultaneously execution is extremely unlikely, AMD does not uses a unified scheduller (unlike Intel), executing an instruction in both units would require touching both FPU schedullers, how would you implement this? I think the serially execution is more likely because the implementation would be much simpler, and AMD already implemented things like that before, serially executing a vector is trivial compared to what they have already done.