By: Adrian (a.delete@this.acm.org), August 31, 2022 1:25 am
Room: Moderated Discussions
anonymous2 (anonymous2.delete@this.example.com) on August 29, 2022 5:08 pm wrote:
> AVX-512 (ISA details murky) on Zen 4 but 2 cycles vs 1 on Intel so only 256b internally.
>
> Small win for those who want the ISA, but from a performance perspective limited value?
>
In the list provided by AMD for contributors to enhanced performance at the same clock frequency, the second place after the new front-end was occupied by load/store enhancements.
I interpret this AMD claim that in Zen 4 the load and store bandwidth between registers and the L1 data cache memory has been doubled in comparison with Zen 3.
Most Intel CPUs that support AVX-512 can initiate in each clock cycle two 512-bit register-register operations, two 512-bit loads and one 512-bit store.
Zen 3 can initiate in each cycle four 256-bit register-register operations, two 256-bit loads and one 256-bit store. By pairing 256-bit pipelines, Zen 4 would have been able to initiate in each cycle two 512-bit register-register operations and one 512-bit load, but one 512-bit store could have been initiated only every other cycle.
That would have matched Intel in register-register operations, but would have been worse for load and store.
So I assume that Zen 4 has been improved to be able to do two 512-bit loads and one 512-bit store per cycle. Thus the throughput of Zen 4 for AVX-512 should match very closely that of the Intel CPUs which lack the second FMA unit, at the same clock frequency.
If, as I expect, Zen 4 will have a ratio of 1:2 for FMA to load, this has important implications for the optimization of many programs.
For about 3 decades, the fastest CPUs have had a ratio of 1:1 between FMA and load (typically two FMA and two loads per cycle).
While in recent years the cheaper models of the Intel CPUs with AVX-512 have also had a 1:2 FMA to load ratio, I do not believe that there have been many cases when someone has bothered to write two distinct code paths in a program, one for more expensive CPUs with an 1:1 ratio and one for cheaper CPUs with an 1:2 ratio.
If the AMD CPUs will become a significant fraction of the CPUs with AVX-512 support, it may become worthwhile to optimize certain functions specifically for CPUs with 1:2 FMA to load ratio.
While the schoolbook definitions of most operations of linear algebra use in the innermost loop a scalar product, implementing linear algebra operations with scalar products is normally the worst possible choice, due to high load to FMA ratio and data dependencies between successive operations.
So in implementations, the loops are reordered. For some operations, e.g. matrix-matrix product, there are 2 choices for the innermost loop, either an AXPY operation or a rank-one matrix update. For other operations, e.g. matrix-vector product, an AXPY operation in the innermost loop is the only choice.
For CPUs with 1:1 FMA to load ratio, rank-one matrix updates are preferred in the innermost loop whenever possible, because when the destination is in registers, they consist of loading 2 vectors from the cache and then computing their tensor product in registers. This requires much more FMA than loads. During the time when no loads are needed for computations, other loads can be used to move data (i.e. array blocks) between memory and cache and between cache levels. Thus it is possible to make a computation at a speed very close to the number of FMA operations that can be done per clock cycle.
When the destination of an AXPY operation is in registers, it requires and equal number of FMA and loads. When this is done on a CPU with a 1:1 FMA to load ratio, the CPU will alternate between times when it executes FMA at the maximum possible rate and times when it executes the extra loads needed to move array blocks between memory and cache and between cache levels. So the speed will always be only a fraction of the speed limited by FMA.
On the other hand, on CPUs with 1:2 FMA to load ratio, both rank-one updates and AXPY operations can reach the maximum theoretical speed given by the number of FMA operations per clock cycle, because during each cycle of an AXPY operation an extra load is available to move data between memory and cache or between cache levels.
It is likely that reaching the maximum theoretical speed on Zen 4 and similar CPUs, when doing AXPY operations, e.g. when multiplying a matrix with a vector or when solving a triangular system of equations, will require a reordering and an interleaving of the operations in comparison with more traditional CPUs where that would not have had any effect on the speed.
> AVX-512 (ISA details murky) on Zen 4 but 2 cycles vs 1 on Intel so only 256b internally.
>
> Small win for those who want the ISA, but from a performance perspective limited value?
>
In the list provided by AMD for contributors to enhanced performance at the same clock frequency, the second place after the new front-end was occupied by load/store enhancements.
I interpret this AMD claim that in Zen 4 the load and store bandwidth between registers and the L1 data cache memory has been doubled in comparison with Zen 3.
Most Intel CPUs that support AVX-512 can initiate in each clock cycle two 512-bit register-register operations, two 512-bit loads and one 512-bit store.
Zen 3 can initiate in each cycle four 256-bit register-register operations, two 256-bit loads and one 256-bit store. By pairing 256-bit pipelines, Zen 4 would have been able to initiate in each cycle two 512-bit register-register operations and one 512-bit load, but one 512-bit store could have been initiated only every other cycle.
That would have matched Intel in register-register operations, but would have been worse for load and store.
So I assume that Zen 4 has been improved to be able to do two 512-bit loads and one 512-bit store per cycle. Thus the throughput of Zen 4 for AVX-512 should match very closely that of the Intel CPUs which lack the second FMA unit, at the same clock frequency.
If, as I expect, Zen 4 will have a ratio of 1:2 for FMA to load, this has important implications for the optimization of many programs.
For about 3 decades, the fastest CPUs have had a ratio of 1:1 between FMA and load (typically two FMA and two loads per cycle).
While in recent years the cheaper models of the Intel CPUs with AVX-512 have also had a 1:2 FMA to load ratio, I do not believe that there have been many cases when someone has bothered to write two distinct code paths in a program, one for more expensive CPUs with an 1:1 ratio and one for cheaper CPUs with an 1:2 ratio.
If the AMD CPUs will become a significant fraction of the CPUs with AVX-512 support, it may become worthwhile to optimize certain functions specifically for CPUs with 1:2 FMA to load ratio.
While the schoolbook definitions of most operations of linear algebra use in the innermost loop a scalar product, implementing linear algebra operations with scalar products is normally the worst possible choice, due to high load to FMA ratio and data dependencies between successive operations.
So in implementations, the loops are reordered. For some operations, e.g. matrix-matrix product, there are 2 choices for the innermost loop, either an AXPY operation or a rank-one matrix update. For other operations, e.g. matrix-vector product, an AXPY operation in the innermost loop is the only choice.
For CPUs with 1:1 FMA to load ratio, rank-one matrix updates are preferred in the innermost loop whenever possible, because when the destination is in registers, they consist of loading 2 vectors from the cache and then computing their tensor product in registers. This requires much more FMA than loads. During the time when no loads are needed for computations, other loads can be used to move data (i.e. array blocks) between memory and cache and between cache levels. Thus it is possible to make a computation at a speed very close to the number of FMA operations that can be done per clock cycle.
When the destination of an AXPY operation is in registers, it requires and equal number of FMA and loads. When this is done on a CPU with a 1:1 FMA to load ratio, the CPU will alternate between times when it executes FMA at the maximum possible rate and times when it executes the extra loads needed to move array blocks between memory and cache and between cache levels. So the speed will always be only a fraction of the speed limited by FMA.
On the other hand, on CPUs with 1:2 FMA to load ratio, both rank-one updates and AXPY operations can reach the maximum theoretical speed given by the number of FMA operations per clock cycle, because during each cycle of an AXPY operation an extra load is available to move data between memory and cache or between cache levels.
It is likely that reaching the maximum theoretical speed on Zen 4 and similar CPUs, when doing AXPY operations, e.g. when multiplying a matrix with a vector or when solving a triangular system of equations, will require a reordering and an interleaving of the operations in comparison with more traditional CPUs where that would not have had any effect on the speed.