By: Adrian (a.delete@this.acm.org), August 31, 2022 10:30 am
Room: Moderated Discussions
Marcus (m.delete@this.bitsnbites.eu) on August 31, 2022 8:55 am wrote:
> Heikki Kultala (heikki.kult.ala.delete@this.gmail.com) on August 31, 2022 7:38 am wrote:
> >
> > Widening the memory data paths would be very expensive, and would do absolutely NOTHING
> > to improve the performance of worklaods that do not use AVX-512. And almost all opf the
> > software that AMD used to clculate this 13% IPC improvemetn does NOT use AVX-512.
> >
>
> In Adrians defense, that depends on what you mean by "widening". If the widening is done by adding
> more load ports you would effectively cater for both 512b loads (they can use several load ports)
> and narrower loads (e.g. several loads from the stack during a function epilogue).
>
> Now, I don't know if that is actually what they have done, but it would make *some* sense to do this improvement
> along with introducing AVX-512 (in order to avoid load bw to become a bottle neck for AVX-512 workloads).
As I have also mentioned in the reply to Heikki, it is pretty certain that AMD did not add any load or store ports.
Zen 3 can already do up to 3 loads per cycle and up to 2 stores per cycle.
However, it has a narrow connection to the L1 data cache that makes impossible the use of all load and store ports for AVX, limiting the throughput to only two 256-bit loads and one 256-bit store.
If in Zen 4 they have doubled the width of the connection to the L1 data cache, in order to match the AVX-512 LD/ST throughput of all Intel CPUs, then that automatically enables also the increase of the AVX throughput to three 256-bit LD/ST per cycle, and up to 2 of them can be stores.
The associated increased AVX throughput explains the IPC increase in the legacy benchmarks.
Until AMD presents the Zen 4 microarchitecture, we cannot know for sure, but designing Zen 4 to be inferior to the competition is not something that I can believe to have happened.
The improved IPC from the presentation explained by improvement of loads and stores cannot mean anything else but a wider connection to the L1 data cache, which was a bottleneck in Zen 3. It cannot mean more LD/ST port as those already existing in Zen 3 cannot be fully used.
To achieve an improvement in AVX over Zen 3, the cache link for loads must be increased from 512 bit per cycle to 768 bit per cycle, while the cache link for stores must be increased from 256 bit per cycle to 512 bit per cycle.
Once the cache link is widened that much, it would be extremely stupid to not widen a little more the link for loads, up to 1024 bit per cycle, to match the performance of the Intel CPUs and to provide balanced LD/ST bandwidth for AVX-512.
> Heikki Kultala (heikki.kult.ala.delete@this.gmail.com) on August 31, 2022 7:38 am wrote:
> >
> > Widening the memory data paths would be very expensive, and would do absolutely NOTHING
> > to improve the performance of worklaods that do not use AVX-512. And almost all opf the
> > software that AMD used to clculate this 13% IPC improvemetn does NOT use AVX-512.
> >
>
> In Adrians defense, that depends on what you mean by "widening". If the widening is done by adding
> more load ports you would effectively cater for both 512b loads (they can use several load ports)
> and narrower loads (e.g. several loads from the stack during a function epilogue).
>
> Now, I don't know if that is actually what they have done, but it would make *some* sense to do this improvement
> along with introducing AVX-512 (in order to avoid load bw to become a bottle neck for AVX-512 workloads).
As I have also mentioned in the reply to Heikki, it is pretty certain that AMD did not add any load or store ports.
Zen 3 can already do up to 3 loads per cycle and up to 2 stores per cycle.
However, it has a narrow connection to the L1 data cache that makes impossible the use of all load and store ports for AVX, limiting the throughput to only two 256-bit loads and one 256-bit store.
If in Zen 4 they have doubled the width of the connection to the L1 data cache, in order to match the AVX-512 LD/ST throughput of all Intel CPUs, then that automatically enables also the increase of the AVX throughput to three 256-bit LD/ST per cycle, and up to 2 of them can be stores.
The associated increased AVX throughput explains the IPC increase in the legacy benchmarks.
Until AMD presents the Zen 4 microarchitecture, we cannot know for sure, but designing Zen 4 to be inferior to the competition is not something that I can believe to have happened.
The improved IPC from the presentation explained by improvement of loads and stores cannot mean anything else but a wider connection to the L1 data cache, which was a bottleneck in Zen 3. It cannot mean more LD/ST port as those already existing in Zen 3 cannot be fully used.
To achieve an improvement in AVX over Zen 3, the cache link for loads must be increased from 512 bit per cycle to 768 bit per cycle, while the cache link for stores must be increased from 256 bit per cycle to 512 bit per cycle.
Once the cache link is widened that much, it would be extremely stupid to not widen a little more the link for loads, up to 1024 bit per cycle, to match the performance of the Intel CPUs and to provide balanced LD/ST bandwidth for AVX-512.