By: Adrian (a.delete@this.acm.org), January 9, 2021 12:48 am
Room: Moderated Discussions
Maynard Handley (name99.delete@this.name99.org) on January 8, 2021 3:59 pm wrote:
> Adrian (a.delete@this.acm.org) on January 8, 2021 2:21 pm wrote:
> > However,
> > unlike for gmpbench, I have not seen yet an Apple M1 DGEMM (or Linpack) result.
>
> Here's pretty much the only thing I've found so far:
>
> https://nod.ai/comparing-apple-m1-with-amx2-m1-with-neon/
>
> There's a bunch of back and forth about what NEON can achieve, but I think the
> bottom line is essentially ~250GFLOPS (FP32 NEON) and twice that via AMX.
>
The numbers for SGEMM with NEON seem to be in the same ballpark as for a 4-core Intel CPU, while the numbers with AMX2 appear similar to the numbers for a 4-core Intel CPU with full AVX-512, or an 8-core AMD CPU. A desktop AMD CPU is much faster, obviously at a higher power consumption. I wonder if the numbers for DGEMM are only half of those for SGEMM, like in Intel/AMD, or they are worse than that (especially for AMX2).
> Do we have any sort of sense as to how much GMP has had care and optimization put into it for ARMv8/NEON
> vs x86? It is still at the basic "get the damn thing working" level, or has the sort of obsessive micro-optimization
> one (eventually) expects in these sorts of libraries already been applied?
> You can see some of the flux that's still happening in this space in the above
> article, even with respect to a somewhat more mainstream library like BLAS.
>
>
It seems that recent versions already have good ARM optimizations, but they are probably still improving.
Even for AMD Zen 3 the optimizations keep improving, because I have tested my CPU with an older libgmp version giving a result of 7337, but a newer non-released yet libgmp version has reached 7816.
> Adrian (a.delete@this.acm.org) on January 8, 2021 2:21 pm wrote:
> > However,
> > unlike for gmpbench, I have not seen yet an Apple M1 DGEMM (or Linpack) result.
>
> Here's pretty much the only thing I've found so far:
>
> https://nod.ai/comparing-apple-m1-with-amx2-m1-with-neon/
>
> There's a bunch of back and forth about what NEON can achieve, but I think the
> bottom line is essentially ~250GFLOPS (FP32 NEON) and twice that via AMX.
>
The numbers for SGEMM with NEON seem to be in the same ballpark as for a 4-core Intel CPU, while the numbers with AMX2 appear similar to the numbers for a 4-core Intel CPU with full AVX-512, or an 8-core AMD CPU. A desktop AMD CPU is much faster, obviously at a higher power consumption. I wonder if the numbers for DGEMM are only half of those for SGEMM, like in Intel/AMD, or they are worse than that (especially for AMX2).
> Do we have any sort of sense as to how much GMP has had care and optimization put into it for ARMv8/NEON
> vs x86? It is still at the basic "get the damn thing working" level, or has the sort of obsessive micro-optimization
> one (eventually) expects in these sorts of libraries already been applied?
> You can see some of the flux that's still happening in this space in the above
> article, even with respect to a somewhat more mainstream library like BLAS.
>
>
It seems that recent versions already have good ARM optimizations, but they are probably still improving.
Even for AMD Zen 3 the optimizations keep improving, because I have tested my CPU with an older libgmp version giving a result of 7337, but a newer non-released yet libgmp version has reached 7816.