By: Adrian (a.delete@this.acm.org), January 11, 2021 4:34 am
Room: Moderated Discussions
none (none.delete@this.none.com) on January 11, 2021 1:26 am wrote:
> Adrian (a.delete@this.acm.org) on January 9, 2021 8:00 am wrote:
> [...]
> > So Zen 3 is around 22% faster than Apple M1 at bignum arithmetic,
> > due to the reasons mentioned in a previous
> > post, i.e. that when an execution resource reaches 100% utilization, the IPC remains clamped at the same
> > value for both M1 and Zen 3, and then the CPU with the higher clock frequency is advantaged.
>
> According to this GMP table M1 has a better IPC than Zen3 for many base routines.
>
> For instance addmul_1 is 1.5 cycle/limb on Zen 3 while it is 1.25 on M1. As far as M1 goes
> this is slightly above the theoretical number of multiplications it can issue (that is
> 2 64x64b->64b (high or low part) integer muls per cycles, so one 64bx64b->128 per cycle).
> mul_1 is at one cycle/limb (vs 1.5 for Zen 3) so that's at the max already, so it's quite
> likely addmul_1 is already at the max due to the rest of computations.
>
> My understanding is that Zen 2 can issue a single 64bx64b->128 per cycle too. So Zen 2 (and
> I assume Zen 3) don't saturate their multipliers contrary to M1 on mul_1.
I also have said that Apple M1 still has a better average IPC than Zen 3 even in gmpbench, as it is obvious from the fact that Zen 3 is 22% faster (5800X) or 24% faster (5900X, my 5900X scores 7951, compared to the 7816 for 5800X in that table), while the clock frequency is 50% (5800X) to 53% (5900X) higher than M1.
However, the IPC advantage of Apple M1 is reduced from around 50% in benchmarks like SPEC or Geekbench to only around 23% in benchmarks like gmpbench.
My theory is that this happens because both M1 and Zen 3 have a single 64bx64b->128 per cycle multiplier, as mentioned by you.
For a fraction of the time of any gmpbench test, apparently about a half of the time, the multiplier is used close to 100%, and both CPUs have the same IPC, while for the rest of the benchmark duration, M1 has its typical IPC, 50% higher than Zen 3. That should lead to the observed average IPC of only 23% higher in gmpbench.
These fractions of time when the speed is limited by back-to-back multiplications and when there is time between multiplications, so that M1 can overlap there more instructions than Zen 3, are probably interleaved inside each limb multiplication iteration, if both CPUs do not reach the maximum possible multiplication rate.
If you have another theory, please explain it.
> Adrian (a.delete@this.acm.org) on January 9, 2021 8:00 am wrote:
> [...]
> > So Zen 3 is around 22% faster than Apple M1 at bignum arithmetic,
> > due to the reasons mentioned in a previous
> > post, i.e. that when an execution resource reaches 100% utilization, the IPC remains clamped at the same
> > value for both M1 and Zen 3, and then the CPU with the higher clock frequency is advantaged.
>
> According to this GMP table M1 has a better IPC than Zen3 for many base routines.
>
> For instance addmul_1 is 1.5 cycle/limb on Zen 3 while it is 1.25 on M1. As far as M1 goes
> this is slightly above the theoretical number of multiplications it can issue (that is
> 2 64x64b->64b (high or low part) integer muls per cycles, so one 64bx64b->128 per cycle).
> mul_1 is at one cycle/limb (vs 1.5 for Zen 3) so that's at the max already, so it's quite
> likely addmul_1 is already at the max due to the rest of computations.
>
> My understanding is that Zen 2 can issue a single 64bx64b->128 per cycle too. So Zen 2 (and
> I assume Zen 3) don't saturate their multipliers contrary to M1 on mul_1.
I also have said that Apple M1 still has a better average IPC than Zen 3 even in gmpbench, as it is obvious from the fact that Zen 3 is 22% faster (5800X) or 24% faster (5900X, my 5900X scores 7951, compared to the 7816 for 5800X in that table), while the clock frequency is 50% (5800X) to 53% (5900X) higher than M1.
However, the IPC advantage of Apple M1 is reduced from around 50% in benchmarks like SPEC or Geekbench to only around 23% in benchmarks like gmpbench.
My theory is that this happens because both M1 and Zen 3 have a single 64bx64b->128 per cycle multiplier, as mentioned by you.
For a fraction of the time of any gmpbench test, apparently about a half of the time, the multiplier is used close to 100%, and both CPUs have the same IPC, while for the rest of the benchmark duration, M1 has its typical IPC, 50% higher than Zen 3. That should lead to the observed average IPC of only 23% higher in gmpbench.
These fractions of time when the speed is limited by back-to-back multiplications and when there is time between multiplications, so that M1 can overlap there more instructions than Zen 3, are probably interleaved inside each limb multiplication iteration, if both CPUs do not reach the maximum possible multiplication rate.
If you have another theory, please explain it.