By: Maynard Handley (name99.delete@this.name99.org), January 8, 2021 3:59 pm
Room: Moderated Discussions
Adrian (a.delete@this.acm.org) on January 8, 2021 2:21 pm wrote:
> hobold (hobold.delete@this.vectorizer.org) on January 8, 2021 12:44 pm wrote:
> > Adrian (a.delete@this.acm.org) on January 2, 2021 2:45 am wrote:
> >
> > [...]
> > > In computational benchmarks where the number and speed of the available execution resources matter
> > > most, unlike in GB5 or SPEC, where the higher *average* IPC of Apple shines, the advantage of Zen
> > > 3 over Apple M1 increases, being e.g. of over 14% @ 4.9 GHz for gmpbench (7337 vs. 6422).
> >
> > That's an interesting contradiction. Apple M1 does have
> > fewer and slower execution resources, but higher IPC.
> >
> > Pure speculation: Apple M1 can sometimes (rarely) execute a pair of dependent, simple
> > instructions within one single clock cycle. This would occasionally (rarely) gain one
> > cycle on the other processors that require two subsequent cycles for the dependency.
> >
> > Naturally this would be a very narrow, very targeted optimization. But there just might
> > be pairs that are frequent enough to be worth it. Maybe something like shift (by a constant)
> > and add. Generally, a chain of two-operand instructions, such that the whole chain does
> > not read / write more registers than what can be done in a single cycle.
>
>
> We do not know whether Apple M1 has fewer execution resources, it seems to have about the same
> resources as Intel, but maybe less than the new Zen 3 has for certain AVX instructions.
>
> However, the M1 resources are necessarily slower, due to 3.2 GHz vs. 4.8 ... 5.0 GHz.
>
> In the case of gmpbench, a non-negligible part of the execution time is spent
> doing one 64-bit multiplication per clock cycle in both AMD Zen 3 and Apple
> M1, with a few other instructions overlapped with the multiplications.
>
> Even if M1 has the potential of overlapping more instructions than Zen, it does not matter for that part
> of the execution time, because its duration is determined by the back-to-back multiplications, so on
> Zen 3 it takes just 2/3 of the time on M1. For that part of the benchmark, both M1 and Zen 3 have the
> same IPC, which cannot be increased when an execution resource is used at 100% of its capacity.
>
> On other parts of the benchmark, M1 succeeds to overlap more instructions than Zen 3, i.e. it
> has a higher IPC, so Zen 3 has a lower advantage than the clock frequency ratio, but still better
> than in benchmarks where M1 can maintain a higher IPC during the entire benchmark.
>
> For operations like DGEMM, which can be executed at as speed close to the floating-point
> fused-multiply-add maximum rate, I expect to see the same behavior, with higher advantage
> for Zen 3 than in SPEC or GB5, where M1 has an IPC 1.5 times almost all the time. However,
> unlike for gmpbench, I have not seen yet an Apple M1 DGEMM (or Linpack) result.
Here's pretty much the only thing I've found so far:
https://nod.ai/comparing-apple-m1-with-amx2-m1-with-neon/
There's a bunch of back and forth about what NEON can achieve, but I think the bottom line is essentially ~250GFLOPS (FP32 NEON) and twice that via AMX.
Do we have any sort of sense as to how much GMP has had care and optimization put into it for ARMv8/NEON vs x86? It is still at the basic "get the damn thing working" level, or has the sort of obsessive micro-optimization one (eventually) expects in these sorts of libraries already been applied?
You can see some of the flux that's still happening in this space in the above article, even with respect to a somewhat more mainstream library like BLAS.
There are similar (matrix multiplication dense) results here:
https://towardsdatascience.com/benchmark-m1-vs-xeon-vs-core-i5-vs-k80-and-t4-e3802f27421c
and
https://ramseyelbasheer.wordpress.com/2021/01/03/benchmark-m1-part-2-vs-20-cores-xeon-vs-amd-epyc-16-and-32-cores/
Of course in a sense these are the opposite extreme, going from dependent very long multiplies to independent short multiplies.
> hobold (hobold.delete@this.vectorizer.org) on January 8, 2021 12:44 pm wrote:
> > Adrian (a.delete@this.acm.org) on January 2, 2021 2:45 am wrote:
> >
> > [...]
> > > In computational benchmarks where the number and speed of the available execution resources matter
> > > most, unlike in GB5 or SPEC, where the higher *average* IPC of Apple shines, the advantage of Zen
> > > 3 over Apple M1 increases, being e.g. of over 14% @ 4.9 GHz for gmpbench (7337 vs. 6422).
> >
> > That's an interesting contradiction. Apple M1 does have
> > fewer and slower execution resources, but higher IPC.
> >
> > Pure speculation: Apple M1 can sometimes (rarely) execute a pair of dependent, simple
> > instructions within one single clock cycle. This would occasionally (rarely) gain one
> > cycle on the other processors that require two subsequent cycles for the dependency.
> >
> > Naturally this would be a very narrow, very targeted optimization. But there just might
> > be pairs that are frequent enough to be worth it. Maybe something like shift (by a constant)
> > and add. Generally, a chain of two-operand instructions, such that the whole chain does
> > not read / write more registers than what can be done in a single cycle.
>
>
> We do not know whether Apple M1 has fewer execution resources, it seems to have about the same
> resources as Intel, but maybe less than the new Zen 3 has for certain AVX instructions.
>
> However, the M1 resources are necessarily slower, due to 3.2 GHz vs. 4.8 ... 5.0 GHz.
>
> In the case of gmpbench, a non-negligible part of the execution time is spent
> doing one 64-bit multiplication per clock cycle in both AMD Zen 3 and Apple
> M1, with a few other instructions overlapped with the multiplications.
>
> Even if M1 has the potential of overlapping more instructions than Zen, it does not matter for that part
> of the execution time, because its duration is determined by the back-to-back multiplications, so on
> Zen 3 it takes just 2/3 of the time on M1. For that part of the benchmark, both M1 and Zen 3 have the
> same IPC, which cannot be increased when an execution resource is used at 100% of its capacity.
>
> On other parts of the benchmark, M1 succeeds to overlap more instructions than Zen 3, i.e. it
> has a higher IPC, so Zen 3 has a lower advantage than the clock frequency ratio, but still better
> than in benchmarks where M1 can maintain a higher IPC during the entire benchmark.
>
> For operations like DGEMM, which can be executed at as speed close to the floating-point
> fused-multiply-add maximum rate, I expect to see the same behavior, with higher advantage
> for Zen 3 than in SPEC or GB5, where M1 has an IPC 1.5 times almost all the time. However,
> unlike for gmpbench, I have not seen yet an Apple M1 DGEMM (or Linpack) result.
Here's pretty much the only thing I've found so far:
https://nod.ai/comparing-apple-m1-with-amx2-m1-with-neon/
There's a bunch of back and forth about what NEON can achieve, but I think the bottom line is essentially ~250GFLOPS (FP32 NEON) and twice that via AMX.
Do we have any sort of sense as to how much GMP has had care and optimization put into it for ARMv8/NEON vs x86? It is still at the basic "get the damn thing working" level, or has the sort of obsessive micro-optimization one (eventually) expects in these sorts of libraries already been applied?
You can see some of the flux that's still happening in this space in the above article, even with respect to a somewhat more mainstream library like BLAS.
There are similar (matrix multiplication dense) results here:
https://towardsdatascience.com/benchmark-m1-vs-xeon-vs-core-i5-vs-k80-and-t4-e3802f27421c
and
https://ramseyelbasheer.wordpress.com/2021/01/03/benchmark-m1-part-2-vs-20-cores-xeon-vs-amd-epyc-16-and-32-cores/
Of course in a sense these are the opposite extreme, going from dependent very long multiplies to independent short multiplies.