By: Adrian (a.delete@this.acm.org), August 31, 2022 10:10 am
Room: Moderated Discussions
Heikki Kultala (heikki.kult.ala.delete@this.gmail.com) on August 31, 2022 7:38 am wrote:
> Adrian (a.delete@this.acm.org) on August 31, 2022 1:25 am wrote:
> > In the list provided by AMD for contributors to enhanced performance at the same clock frequency,
> > the second place after the new front-end was occupied by load/store enhancements.
>
> .. and which has NOTHING to so with AVX-512.
There is no information allowing a conclusion that this has something to do with AVX-512 or that it has nothing to do with AVX-512.
You may guess that it has nothing to do with AVX-512, but you should present arguments to support your guess.
I have presented solid arguments for this being a change conditioned by the support of AVX-512. Keeping the Zen 3 L1 cache bandwidth would result in an unbalanced design for AVX-512 and in significantly lower performance than all Intel CPUs.
It would have been stupid for the AMD designers to implement AVX-512 support in that way, i.e. by making certain that Zen 4 would be inferior to the competition.
>
> AMD reported average 13% IPC improvement, and most of the software on the
> list used to calcualte the IPC improvemetn did not use AVX-512 at all.
Increasing the LD/ST bandwidth to two 512-bit loads per cycle plus one 512-bit store per cycle is certain to also allow increased throughput for the 256-bit loads and stores.
Therefore it is likely that Zen 4 is able to do up to three 256-bit loads per cycle instead of up to two 256-bit loads in Zen 3 and up to two 256-bit stores instead of up to one 256-bit store in Zen 3.
Such a behavior would be very similar to Golden Cove from Alder Lake. Therefore the associated increase of AVX load/store bandwidth explains easily the IPC gains from the table.
>
> > I interpret this AMD claim that in Zen 4 the load and store bandwidth between registers
> > and the L1 data cache memory has been doubled in comparison with Zen 3.
>
> That is not an interpretation. That is stupid speculation that has NOTHING to do with
> the original text where you claim to base it. IT's a VERY BAD misinterpretation.
>
I said since the beginning that this is speculation, because AMD has not provided yet any information about the Zen 4 microarchitecture.
It is OK for you to disagree, but please present some arguments for your opinion, because I do not see any.
>
> Widening the memory data paths would be very expensive, and would do absolutely NOTHING
> to improve the performance of worklaods that do not use AVX-512. And almost all opf the
> software that AMD used to clculate this 13% IPC improvemetn does NOT use AVX-512.
>
> The load/store improvement that gives part of that 13% has to be something totally unrelated to AVX-512.
As I have said above, the increased bandwidth for the AVX-512 LD/ST in the Intel CPUs also allows an increased bandwidth for AVX. It should be expected that AMD does the same thing.
This is especially expected because Zen 3 cannot use all its load/store execution units because of insufficient bandwidth to the L1 data cache.
Zen 3 is already able to do 3 loads per cycle, but it is limited to only two 256-bit loads due to a too narrow link with the cache. The same for stores, Zen 3 can already do 2 stores per cycle, but it is limited to only one 256-bit store due to the narrow link.
So they did not need to change anything in the load/store units, they needed just to double the width of the connection to the L1 data cache to improve the AVX LD/ST bandwidth and to provide enough LD/ST bandwidth for AVX-512.
>
> Also, AFAIK there are no 512-bit registers on Zen4.
Zen 4 supports AVX-512, therefore it *MUST* have 32 512-bit registers. There is no doubt about that.
> Adrian (a.delete@this.acm.org) on August 31, 2022 1:25 am wrote:
> > In the list provided by AMD for contributors to enhanced performance at the same clock frequency,
> > the second place after the new front-end was occupied by load/store enhancements.
>
> .. and which has NOTHING to so with AVX-512.
There is no information allowing a conclusion that this has something to do with AVX-512 or that it has nothing to do with AVX-512.
You may guess that it has nothing to do with AVX-512, but you should present arguments to support your guess.
I have presented solid arguments for this being a change conditioned by the support of AVX-512. Keeping the Zen 3 L1 cache bandwidth would result in an unbalanced design for AVX-512 and in significantly lower performance than all Intel CPUs.
It would have been stupid for the AMD designers to implement AVX-512 support in that way, i.e. by making certain that Zen 4 would be inferior to the competition.
>
> AMD reported average 13% IPC improvement, and most of the software on the
> list used to calcualte the IPC improvemetn did not use AVX-512 at all.
Increasing the LD/ST bandwidth to two 512-bit loads per cycle plus one 512-bit store per cycle is certain to also allow increased throughput for the 256-bit loads and stores.
Therefore it is likely that Zen 4 is able to do up to three 256-bit loads per cycle instead of up to two 256-bit loads in Zen 3 and up to two 256-bit stores instead of up to one 256-bit store in Zen 3.
Such a behavior would be very similar to Golden Cove from Alder Lake. Therefore the associated increase of AVX load/store bandwidth explains easily the IPC gains from the table.
>
> > I interpret this AMD claim that in Zen 4 the load and store bandwidth between registers
> > and the L1 data cache memory has been doubled in comparison with Zen 3.
>
> That is not an interpretation. That is stupid speculation that has NOTHING to do with
> the original text where you claim to base it. IT's a VERY BAD misinterpretation.
>
I said since the beginning that this is speculation, because AMD has not provided yet any information about the Zen 4 microarchitecture.
It is OK for you to disagree, but please present some arguments for your opinion, because I do not see any.
>
> Widening the memory data paths would be very expensive, and would do absolutely NOTHING
> to improve the performance of worklaods that do not use AVX-512. And almost all opf the
> software that AMD used to clculate this 13% IPC improvemetn does NOT use AVX-512.
>
> The load/store improvement that gives part of that 13% has to be something totally unrelated to AVX-512.
As I have said above, the increased bandwidth for the AVX-512 LD/ST in the Intel CPUs also allows an increased bandwidth for AVX. It should be expected that AMD does the same thing.
This is especially expected because Zen 3 cannot use all its load/store execution units because of insufficient bandwidth to the L1 data cache.
Zen 3 is already able to do 3 loads per cycle, but it is limited to only two 256-bit loads due to a too narrow link with the cache. The same for stores, Zen 3 can already do 2 stores per cycle, but it is limited to only one 256-bit store due to the narrow link.
So they did not need to change anything in the load/store units, they needed just to double the width of the connection to the L1 data cache to improve the AVX LD/ST bandwidth and to provide enough LD/ST bandwidth for AVX-512.
>
> Also, AFAIK there are no 512-bit registers on Zen4.
Zen 4 supports AVX-512, therefore it *MUST* have 32 512-bit registers. There is no doubt about that.