By: Michael S (already5chosen.delete@this.yahoo.com), May 29, 2022 2:15 am
Room: Moderated Discussions
Jan Wassenberg (jan.wassenberg.delete@this.gmail.com) on May 29, 2022 12:49 am wrote:
> Simon Farnsworth (simon.delete@this.farnz.org.uk) on May 25, 2022 2:27 am wrote:
> > [..] you can do
> > an arbitrary horizontal operation like that by having each of the small vectors output its "share"
> > of the final 512 bit result into 4x 128 bit intermediates, then a combine step (probably multi-cycle
> > in itself) that takes the 16 intermediates and merges them back into the final 512 bit vector.
> >
> > This would mean a 20 cycle operation done naïvely, but gets you slow and dependable AVX-512 on the E cores
> > - and if performance matters, you're running on the P cores anyway. The optimization work would be to make
> > those instructions energy efficient, for which you might introduce special cases - a full 512 bit ALU is a
> > lot more costly in power and area than hardware dedicated to doing compress/expand/permute and nothing else.
>
> I wonder how power-efficient that would be. One point of comparison is that M1's 4x128 NEON runs our Quicksort
> at about half the speed of SKX AVX-512. This doesn't entirely vindicate smaller vectors though, because
> M1's clock frequency and single-core memory bandwidth are higher, and the constant factors for 128-bit sorting
> networks are smaller (so we're not actually comparing 512-bit with quad-pumped 512-bit).
Can't say that I understood what compared to what and what is quicker. And how many cores were running.
>
> To be clear, I'd still rather have your "quad-pumped AVX-512 on E cores" than nothing or AVX2.
> Even better if it has actual 512-bit shuffle networks. The question is: who can say what kind
> of hardware we are actually going to get? And what if, as Brendan(?) says, one feature (think
> TSX) has to be disabled on a certain type of core? Should we then disable it on all, or do
> some legwork in the scheduler to honor a "don't move between CPU type" request?
In specific case of TSX the answwer (Yes!!!) is clear.
But by now, hopefully, TSX is solidly dead in client CPUs. It should have been like that from the beginning.
As to original question, Gracemont has enough EUs to do nearly all AVX-512 OPs at rate of 1 per 2 clocks. Is it considered "quad-pumped" or "double-pumped"?
Shuffles that cross 128-bit lanes, are, of course, harder and likely have to be done by microcode.
But, then again, are these shuffles running at the same speed as other OPs on Golden Cove?
Overall, my answer depends on Intel's general vision with regard to AVX-512. Do they want it in the future or would they prefer it die? In the former case narrow implementations o E cores make perfect sense [Except that I don't think that mixing E and P on the same die makes sense, but that's a separate war].
But from technical perspective it seems to me that letting AVx-512 to die is better plan.
Some good ideas of AVX-512 (but certainly *not* universal predication) can be bolted on AVX2.
But overall, something like AMX looks like better (than AVX-512) plan for those who need a lot of FLOPs and need them relatively close to main cores. And if AMX is implemented as shared accelerator[s], as it almost certainly should be, it solves the problem of E vs P in the best possible manner.
> Simon Farnsworth (simon.delete@this.farnz.org.uk) on May 25, 2022 2:27 am wrote:
> > [..] you can do
> > an arbitrary horizontal operation like that by having each of the small vectors output its "share"
> > of the final 512 bit result into 4x 128 bit intermediates, then a combine step (probably multi-cycle
> > in itself) that takes the 16 intermediates and merges them back into the final 512 bit vector.
> >
> > This would mean a 20 cycle operation done naïvely, but gets you slow and dependable AVX-512 on the E cores
> > - and if performance matters, you're running on the P cores anyway. The optimization work would be to make
> > those instructions energy efficient, for which you might introduce special cases - a full 512 bit ALU is a
> > lot more costly in power and area than hardware dedicated to doing compress/expand/permute and nothing else.
>
> I wonder how power-efficient that would be. One point of comparison is that M1's 4x128 NEON runs our Quicksort
> at about half the speed of SKX AVX-512. This doesn't entirely vindicate smaller vectors though, because
> M1's clock frequency and single-core memory bandwidth are higher, and the constant factors for 128-bit sorting
> networks are smaller (so we're not actually comparing 512-bit with quad-pumped 512-bit).
Can't say that I understood what compared to what and what is quicker. And how many cores were running.
>
> To be clear, I'd still rather have your "quad-pumped AVX-512 on E cores" than nothing or AVX2.
> Even better if it has actual 512-bit shuffle networks. The question is: who can say what kind
> of hardware we are actually going to get? And what if, as Brendan(?) says, one feature (think
> TSX) has to be disabled on a certain type of core? Should we then disable it on all, or do
> some legwork in the scheduler to honor a "don't move between CPU type" request?
In specific case of TSX the answwer (Yes!!!) is clear.
But by now, hopefully, TSX is solidly dead in client CPUs. It should have been like that from the beginning.
As to original question, Gracemont has enough EUs to do nearly all AVX-512 OPs at rate of 1 per 2 clocks. Is it considered "quad-pumped" or "double-pumped"?
Shuffles that cross 128-bit lanes, are, of course, harder and likely have to be done by microcode.
But, then again, are these shuffles running at the same speed as other OPs on Golden Cove?
Overall, my answer depends on Intel's general vision with regard to AVX-512. Do they want it in the future or would they prefer it die? In the former case narrow implementations o E cores make perfect sense [Except that I don't think that mixing E and P on the same die makes sense, but that's a separate war].
But from technical perspective it seems to me that letting AVx-512 to die is better plan.
Some good ideas of AVX-512 (but certainly *not* universal predication) can be bolted on AVX2.
But overall, something like AMX looks like better (than AVX-512) plan for those who need a lot of FLOPs and need them relatively close to main cores. And if AMX is implemented as shared accelerator[s], as it almost certainly should be, it solves the problem of E vs P in the best possible manner.