By: Michael S (already5chosen.delete@this.yahoo.com), May 29, 2022 1:38 pm
Room: Moderated Discussions
Jan Wassenberg (jan.wassenberg.delete@this.gmail.com) on May 29, 2022 11:20 am wrote:
> Michael S (already5chosen.delete@this.yahoo.com) on May 29, 2022 2:15 am wrote:
> > Can't say that I understood what compared to what and what is quicker. And how many cores were running.
> This was (single-threaded) https://github.com/google/highway/blob/master/hwy/contrib/sort/bench_sort.cc,
> running mostly the same Quicksort code for AVX-512 on SKX and NEON for M1. Partitioning is quite
> efficient with AVX-512 but emulated on NEON using VTBL. Conversely, the sorting network for AVX-512
> has to handle 8 elements, whereas NEON only 2, so constant factors are lower.
>
A bit too much of templatism to figure out [in this late hour] what exactly is sorted, what is the key and what is the payload. Esp. the later.
But if a key-payload is what I think it is then it looks more like interesting exercise than
something practical.
For relatively short numeric keys and pointer-sizes (or smaller) payloads quicksort is rarely the
right algorithm. More often than not it is soundly beaten by variations of radix sort.
May be, there are ranges of N that are both big enough for O(N*LogN) quicksort too be better than O(N*N) like straight insertion and at the same too small for O(N) radix sorts, but I didn't encounter them in practice.
> > As to original question, Gracemont has enough EUs to do nearly all AVX-512 OPs
> > at rate of 1 per 2 clocks. Is it considered "quad-pumped" or "double-pumped"?
> Probably still quad? SKX routinely executes 2 vector instructions per cycle.
>
> > But, then again, are these shuffles running at the same speed as other OPs on Golden Cove?
> There it's already 3 cycles latency, single cycle throughput, like it was on SKX.
> Blocking port5 for 4 cycles plus 3? for merging would be quite a slowdown.
>
> > Do they want it in the future or would they prefer it die?
> Yes, that is a very interesting question.
>
> > But from technical perspective it seems to me that letting AVx-512 to die is better plan.
> That's surprising. How many developers will be enthusiastic about developing for a bolted-on AVX2?
>
> > But overall, something like AMX looks like better (than AVX-512) plan for those who need a
> > lot of FLOPs
> Perhaps. But not everything is a bf16 matmul.
AMX as idea, not as particular implementation. Certainly with support for single precision.
Likely with support for double.
And while not everything is matmul, a lot of "dense" compute-intensive things can be turned into likes of matmul.
And if your workload is not "dense" then you are probably limited by bandwidth of one or another cache/memory level and can't take advantage of amount of FLOPs provided by good old AVX+FMA, much less so by AVX-512.
Yes, I know, my analysis here is too black&white :( Not only colors are missing, even shades of gray. But at least I am not at risk of overlooking the forest behind the trees :-)
> Michael S (already5chosen.delete@this.yahoo.com) on May 29, 2022 2:15 am wrote:
> > Can't say that I understood what compared to what and what is quicker. And how many cores were running.
> This was (single-threaded) https://github.com/google/highway/blob/master/hwy/contrib/sort/bench_sort.cc,
> running mostly the same Quicksort code for AVX-512 on SKX and NEON for M1. Partitioning is quite
> efficient with AVX-512 but emulated on NEON using VTBL. Conversely, the sorting network for AVX-512
> has to handle 8 elements, whereas NEON only 2, so constant factors are lower.
>
A bit too much of templatism to figure out [in this late hour] what exactly is sorted, what is the key and what is the payload. Esp. the later.
But if a key-payload is what I think it is then it looks more like interesting exercise than
something practical.
For relatively short numeric keys and pointer-sizes (or smaller) payloads quicksort is rarely the
right algorithm. More often than not it is soundly beaten by variations of radix sort.
May be, there are ranges of N that are both big enough for O(N*LogN) quicksort too be better than O(N*N) like straight insertion and at the same too small for O(N) radix sorts, but I didn't encounter them in practice.
> > As to original question, Gracemont has enough EUs to do nearly all AVX-512 OPs
> > at rate of 1 per 2 clocks. Is it considered "quad-pumped" or "double-pumped"?
> Probably still quad? SKX routinely executes 2 vector instructions per cycle.
>
> > But, then again, are these shuffles running at the same speed as other OPs on Golden Cove?
> There it's already 3 cycles latency, single cycle throughput, like it was on SKX.
> Blocking port5 for 4 cycles plus 3? for merging would be quite a slowdown.
>
> > Do they want it in the future or would they prefer it die?
> Yes, that is a very interesting question.
>
> > But from technical perspective it seems to me that letting AVx-512 to die is better plan.
> That's surprising. How many developers will be enthusiastic about developing for a bolted-on AVX2?
>
> > But overall, something like AMX looks like better (than AVX-512) plan for those who need a
> > lot of FLOPs
> Perhaps. But not everything is a bf16 matmul.
AMX as idea, not as particular implementation. Certainly with support for single precision.
Likely with support for double.
And while not everything is matmul, a lot of "dense" compute-intensive things can be turned into likes of matmul.
And if your workload is not "dense" then you are probably limited by bandwidth of one or another cache/memory level and can't take advantage of amount of FLOPs provided by good old AVX+FMA, much less so by AVX-512.
Yes, I know, my analysis here is too black&white :( Not only colors are missing, even shades of gray. But at least I am not at risk of overlooking the forest behind the trees :-)