By: Adrian (a.delete@this.acm.org), January 8, 2021 2:21 pm
Room: Moderated Discussions
hobold (hobold.delete@this.vectorizer.org) on January 8, 2021 12:44 pm wrote:
> Adrian (a.delete@this.acm.org) on January 2, 2021 2:45 am wrote:
>
> [...]
> > In computational benchmarks where the number and speed of the available execution resources matter
> > most, unlike in GB5 or SPEC, where the higher *average* IPC of Apple shines, the advantage of Zen
> > 3 over Apple M1 increases, being e.g. of over 14% @ 4.9 GHz for gmpbench (7337 vs. 6422).
>
> That's an interesting contradiction. Apple M1 does have
> fewer and slower execution resources, but higher IPC.
>
> Pure speculation: Apple M1 can sometimes (rarely) execute a pair of dependent, simple
> instructions within one single clock cycle. This would occasionally (rarely) gain one
> cycle on the other processors that require two subsequent cycles for the dependency.
>
> Naturally this would be a very narrow, very targeted optimization. But there just might
> be pairs that are frequent enough to be worth it. Maybe something like shift (by a constant)
> and add. Generally, a chain of two-operand instructions, such that the whole chain does
> not read / write more registers than what can be done in a single cycle.
We do not know whether Apple M1 has fewer execution resources, it seems to have about the same resources as Intel, but maybe less than the new Zen 3 has for certain AVX instructions.
However, the M1 resources are necessarily slower, due to 3.2 GHz vs. 4.8 ... 5.0 GHz.
In the case of gmpbench, a non-negligible part of the execution time is spent doing one 64-bit multiplication per clock cycle in both AMD Zen 3 and Apple M1, with a few other instructions overlapped with the multiplications.
Even if M1 has the potential of overlapping more instructions than Zen, it does not matter for that part of the execution time, because its duration is determined by the back-to-back multiplications, so on Zen 3 it takes just 2/3 of the time on M1. For that part of the benchmark, both M1 and Zen 3 have the same IPC, which cannot be increased when an execution resource is used at 100% of its capacity.
On other parts of the benchmark, M1 succeeds to overlap more instructions than Zen 3, i.e. it has a higher IPC, so Zen 3 has a lower advantage than the clock frequency ratio, but still better than in benchmarks where M1 can maintain a higher IPC during the entire benchmark.
For operations like DGEMM, which can be executed at as speed close to the floating-point fused-multiply-add maximum rate, I expect to see the same behavior, with higher advantage for Zen 3 than in SPEC or GB5, where M1 has an IPC 1.5 times almost all the time. However, unlike for gmpbench, I have not seen yet an Apple M1 DGEMM (or Linpack) result.
> Adrian (a.delete@this.acm.org) on January 2, 2021 2:45 am wrote:
>
> [...]
> > In computational benchmarks where the number and speed of the available execution resources matter
> > most, unlike in GB5 or SPEC, where the higher *average* IPC of Apple shines, the advantage of Zen
> > 3 over Apple M1 increases, being e.g. of over 14% @ 4.9 GHz for gmpbench (7337 vs. 6422).
>
> That's an interesting contradiction. Apple M1 does have
> fewer and slower execution resources, but higher IPC.
>
> Pure speculation: Apple M1 can sometimes (rarely) execute a pair of dependent, simple
> instructions within one single clock cycle. This would occasionally (rarely) gain one
> cycle on the other processors that require two subsequent cycles for the dependency.
>
> Naturally this would be a very narrow, very targeted optimization. But there just might
> be pairs that are frequent enough to be worth it. Maybe something like shift (by a constant)
> and add. Generally, a chain of two-operand instructions, such that the whole chain does
> not read / write more registers than what can be done in a single cycle.
We do not know whether Apple M1 has fewer execution resources, it seems to have about the same resources as Intel, but maybe less than the new Zen 3 has for certain AVX instructions.
However, the M1 resources are necessarily slower, due to 3.2 GHz vs. 4.8 ... 5.0 GHz.
In the case of gmpbench, a non-negligible part of the execution time is spent doing one 64-bit multiplication per clock cycle in both AMD Zen 3 and Apple M1, with a few other instructions overlapped with the multiplications.
Even if M1 has the potential of overlapping more instructions than Zen, it does not matter for that part of the execution time, because its duration is determined by the back-to-back multiplications, so on Zen 3 it takes just 2/3 of the time on M1. For that part of the benchmark, both M1 and Zen 3 have the same IPC, which cannot be increased when an execution resource is used at 100% of its capacity.
On other parts of the benchmark, M1 succeeds to overlap more instructions than Zen 3, i.e. it has a higher IPC, so Zen 3 has a lower advantage than the clock frequency ratio, but still better than in benchmarks where M1 can maintain a higher IPC during the entire benchmark.
For operations like DGEMM, which can be executed at as speed close to the floating-point fused-multiply-add maximum rate, I expect to see the same behavior, with higher advantage for Zen 3 than in SPEC or GB5, where M1 has an IPC 1.5 times almost all the time. However, unlike for gmpbench, I have not seen yet an Apple M1 DGEMM (or Linpack) result.