By: Maynard Handley (name99.delete@this.name99.org), January 8, 2021 4:35 pm
Room: Moderated Discussions
Maynard Handley (name99.delete@this.name99.org) on January 8, 2021 4:34 pm wrote:
> Maynard Handley (name99.delete@this.name99.org) on January 8, 2021 3:59 pm wrote:
> > Adrian (a.delete@this.acm.org) on January 8, 2021 2:21 pm wrote:
> > > hobold (hobold.delete@this.vectorizer.org) on January 8, 2021 12:44 pm wrote:
> > > > Adrian (a.delete@this.acm.org) on January 2, 2021 2:45 am wrote:
> > > >
> > > > [...]
> > > > > In computational benchmarks where the number and speed of the available execution resources matter
> > > > > most, unlike in GB5 or SPEC, where the higher *average* IPC of Apple shines, the advantage of Zen
> > > > > 3 over Apple M1 increases, being e.g. of over 14% @ 4.9 GHz for gmpbench (7337 vs. 6422).
> > > >
> > > > That's an interesting contradiction. Apple M1 does have
> > > > fewer and slower execution resources, but higher IPC.
> > > >
> > > > Pure speculation: Apple M1 can sometimes (rarely) execute a pair of dependent, simple
> > > > instructions within one single clock cycle. This would occasionally (rarely) gain one
> > > > cycle on the other processors that require two subsequent cycles for the dependency.
> > > >
> > > > Naturally this would be a very narrow, very targeted optimization. But there just might
> > > > be pairs that are frequent enough to be worth it. Maybe something like shift (by a constant)
> > > > and add. Generally, a chain of two-operand instructions, such that the whole chain does
> > > > not read / write more registers than what can be done in a single cycle.
> > >
> > >
> > > We do not know whether Apple M1 has fewer execution resources, it seems to have about the same
> > > resources as Intel, but maybe less than the new Zen 3 has for certain AVX instructions.
> > >
> > > However, the M1 resources are necessarily slower, due to 3.2 GHz vs. 4.8 ... 5.0 GHz.
> > >
> > > In the case of gmpbench, a non-negligible part of the execution time is spent
> > > doing one 64-bit multiplication per clock cycle in both AMD Zen 3 and Apple
> > > M1, with a few other instructions overlapped with the multiplications.
> > >
> > > Even if M1 has the potential of overlapping more instructions than Zen, it does not matter for that part
> > > of the execution time, because its duration is determined by the back-to-back multiplications, so on
> > > Zen 3 it takes just 2/3 of the time on M1. For that part of the benchmark, both M1 and Zen 3 have the
> > > same IPC, which cannot be increased when an execution resource is used at 100% of its capacity.
> > >
> > > On other parts of the benchmark, M1 succeeds to overlap more instructions than Zen 3, i.e. it
> > > has a higher IPC, so Zen 3 has a lower advantage than the clock frequency ratio, but still better
> > > than in benchmarks where M1 can maintain a higher IPC during the entire benchmark.
> > >
> > > For operations like DGEMM, which can be executed at as speed close to the floating-point
> > > fused-multiply-add maximum rate, I expect to see the same behavior, with higher advantage
> > > for Zen 3 than in SPEC or GB5, where M1 has an IPC 1.5 times almost all the time. However,
> > > unlike for gmpbench, I have not seen yet an Apple M1 DGEMM (or Linpack) result.
> >
> > Here's pretty much the only thing I've found so far:
> >
> > https://nod.ai/comparing-apple-m1-with-amx2-m1-with-neon/
> >
> > There's a bunch of back and forth about what NEON can achieve, but I think the
> > bottom line is essentially ~250GFLOPS (FP32 NEON) and twice that via AMX.
> >
> > Do we have any sort of sense as to how much GMP has had care and optimization put into it for ARMv8/NEON
> > vs x86? It is still at the basic "get the damn thing working"
> > level, or has the sort of obsessive micro-optimization
> > one (eventually) expects in these sorts of libraries already been applied?
> > You can see some of the flux that's still happening in this space in the above
> > article, even with respect to a somewhat more mainstream library like BLAS.
> >
> >
> > There are similar (matrix multiplication dense) results here:
> > https://towardsdatascience.com/benchmark-m1-vs-xeon-vs-core-i5-vs-k80-and-t4-e3802f27421c
> > and
> > https://ramseyelbasheer.wordpress.com/2021/01/03/benchmark-m1-part-2-vs-20-cores-xeon-vs-amd-epyc-16-and-32-cores/
> >
> > Of course in a sense these are the opposite extreme, going from dependent
> > very long multiplies to independent short multiplies.
> >
>
> This article covers a range of scientific software from the viewpoint of a practitioner who
> just wants things to work. Summary is that a surprising amount of stuff has already been ported,
> what's left is mostly in progress, results are almost always remarkably impressive.
> There are, if you scroll to the very end, SGEMM/DGEMM results for very large matrices,
> with M1 doing somewhat better than an AMD3900X (which is, I think 12 core).
Goddamn the lack of editing. Forgot the URL:
https://github.com/neurolabusc/AppleSiliconForNeuroimaging
> Maynard Handley (name99.delete@this.name99.org) on January 8, 2021 3:59 pm wrote:
> > Adrian (a.delete@this.acm.org) on January 8, 2021 2:21 pm wrote:
> > > hobold (hobold.delete@this.vectorizer.org) on January 8, 2021 12:44 pm wrote:
> > > > Adrian (a.delete@this.acm.org) on January 2, 2021 2:45 am wrote:
> > > >
> > > > [...]
> > > > > In computational benchmarks where the number and speed of the available execution resources matter
> > > > > most, unlike in GB5 or SPEC, where the higher *average* IPC of Apple shines, the advantage of Zen
> > > > > 3 over Apple M1 increases, being e.g. of over 14% @ 4.9 GHz for gmpbench (7337 vs. 6422).
> > > >
> > > > That's an interesting contradiction. Apple M1 does have
> > > > fewer and slower execution resources, but higher IPC.
> > > >
> > > > Pure speculation: Apple M1 can sometimes (rarely) execute a pair of dependent, simple
> > > > instructions within one single clock cycle. This would occasionally (rarely) gain one
> > > > cycle on the other processors that require two subsequent cycles for the dependency.
> > > >
> > > > Naturally this would be a very narrow, very targeted optimization. But there just might
> > > > be pairs that are frequent enough to be worth it. Maybe something like shift (by a constant)
> > > > and add. Generally, a chain of two-operand instructions, such that the whole chain does
> > > > not read / write more registers than what can be done in a single cycle.
> > >
> > >
> > > We do not know whether Apple M1 has fewer execution resources, it seems to have about the same
> > > resources as Intel, but maybe less than the new Zen 3 has for certain AVX instructions.
> > >
> > > However, the M1 resources are necessarily slower, due to 3.2 GHz vs. 4.8 ... 5.0 GHz.
> > >
> > > In the case of gmpbench, a non-negligible part of the execution time is spent
> > > doing one 64-bit multiplication per clock cycle in both AMD Zen 3 and Apple
> > > M1, with a few other instructions overlapped with the multiplications.
> > >
> > > Even if M1 has the potential of overlapping more instructions than Zen, it does not matter for that part
> > > of the execution time, because its duration is determined by the back-to-back multiplications, so on
> > > Zen 3 it takes just 2/3 of the time on M1. For that part of the benchmark, both M1 and Zen 3 have the
> > > same IPC, which cannot be increased when an execution resource is used at 100% of its capacity.
> > >
> > > On other parts of the benchmark, M1 succeeds to overlap more instructions than Zen 3, i.e. it
> > > has a higher IPC, so Zen 3 has a lower advantage than the clock frequency ratio, but still better
> > > than in benchmarks where M1 can maintain a higher IPC during the entire benchmark.
> > >
> > > For operations like DGEMM, which can be executed at as speed close to the floating-point
> > > fused-multiply-add maximum rate, I expect to see the same behavior, with higher advantage
> > > for Zen 3 than in SPEC or GB5, where M1 has an IPC 1.5 times almost all the time. However,
> > > unlike for gmpbench, I have not seen yet an Apple M1 DGEMM (or Linpack) result.
> >
> > Here's pretty much the only thing I've found so far:
> >
> > https://nod.ai/comparing-apple-m1-with-amx2-m1-with-neon/
> >
> > There's a bunch of back and forth about what NEON can achieve, but I think the
> > bottom line is essentially ~250GFLOPS (FP32 NEON) and twice that via AMX.
> >
> > Do we have any sort of sense as to how much GMP has had care and optimization put into it for ARMv8/NEON
> > vs x86? It is still at the basic "get the damn thing working"
> > level, or has the sort of obsessive micro-optimization
> > one (eventually) expects in these sorts of libraries already been applied?
> > You can see some of the flux that's still happening in this space in the above
> > article, even with respect to a somewhat more mainstream library like BLAS.
> >
> >
> > There are similar (matrix multiplication dense) results here:
> > https://towardsdatascience.com/benchmark-m1-vs-xeon-vs-core-i5-vs-k80-and-t4-e3802f27421c
> > and
> > https://ramseyelbasheer.wordpress.com/2021/01/03/benchmark-m1-part-2-vs-20-cores-xeon-vs-amd-epyc-16-and-32-cores/
> >
> > Of course in a sense these are the opposite extreme, going from dependent
> > very long multiplies to independent short multiplies.
> >
>
> This article covers a range of scientific software from the viewpoint of a practitioner who
> just wants things to work. Summary is that a surprising amount of stuff has already been ported,
> what's left is mostly in progress, results are almost always remarkably impressive.
> There are, if you scroll to the very end, SGEMM/DGEMM results for very large matrices,
> with M1 doing somewhat better than an AMD3900X (which is, I think 12 core).
Goddamn the lack of editing. Forgot the URL:
https://github.com/neurolabusc/AppleSiliconForNeuroimaging