By: Jan Wassenberg (jan.wassenberg.delete@this.gmail.com), May 29, 2022 11:20 am
Room: Moderated Discussions
Michael S (already5chosen.delete@this.yahoo.com) on May 29, 2022 2:15 am wrote:
> Can't say that I understood what compared to what and what is quicker. And how many cores were running.
This was (single-threaded) https://github.com/google/highway/blob/master/hwy/contrib/sort/bench_sort.cc, running mostly the same Quicksort code for AVX-512 on SKX and NEON for M1. Partitioning is quite efficient with AVX-512 but emulated on NEON using VTBL. Conversely, the sorting network for AVX-512 has to handle 8 elements, whereas NEON only 2, so constant factors are lower.
> As to original question, Gracemont has enough EUs to do nearly all AVX-512 OPs
> at rate of 1 per 2 clocks. Is it considered "quad-pumped" or "double-pumped"?
Probably still quad? SKX routinely executes 2 vector instructions per cycle.
> But, then again, are these shuffles running at the same speed as other OPs on Golden Cove?
There it's already 3 cycles latency, single cycle throughput, like it was on SKX.
Blocking port5 for 4 cycles plus 3? for merging would be quite a slowdown.
> Do they want it in the future or would they prefer it die?
Yes, that is a very interesting question.
> But from technical perspective it seems to me that letting AVx-512 to die is better plan.
That's surprising. How many developers will be enthusiastic about developing for a bolted-on AVX2?
> But overall, something like AMX looks like better (than AVX-512) plan for those who need a
> lot of FLOPs
Perhaps. But not everything is a bf16 matmul.
> Can't say that I understood what compared to what and what is quicker. And how many cores were running.
This was (single-threaded) https://github.com/google/highway/blob/master/hwy/contrib/sort/bench_sort.cc, running mostly the same Quicksort code for AVX-512 on SKX and NEON for M1. Partitioning is quite efficient with AVX-512 but emulated on NEON using VTBL. Conversely, the sorting network for AVX-512 has to handle 8 elements, whereas NEON only 2, so constant factors are lower.
> As to original question, Gracemont has enough EUs to do nearly all AVX-512 OPs
> at rate of 1 per 2 clocks. Is it considered "quad-pumped" or "double-pumped"?
Probably still quad? SKX routinely executes 2 vector instructions per cycle.
> But, then again, are these shuffles running at the same speed as other OPs on Golden Cove?
There it's already 3 cycles latency, single cycle throughput, like it was on SKX.
Blocking port5 for 4 cycles plus 3? for merging would be quite a slowdown.
> Do they want it in the future or would they prefer it die?
Yes, that is a very interesting question.
> But from technical perspective it seems to me that letting AVx-512 to die is better plan.
That's surprising. How many developers will be enthusiastic about developing for a bolted-on AVX2?
> But overall, something like AMX looks like better (than AVX-512) plan for those who need a
> lot of FLOPs
Perhaps. But not everything is a bf16 matmul.