By: Jan Wassenberg (jan.wassenberg.delete@this.gmail.com), May 29, 2022 12:49 am
Room: Moderated Discussions
Simon Farnsworth (simon.delete@this.farnz.org.uk) on May 25, 2022 2:27 am wrote:
> [..] you can do
> an arbitrary horizontal operation like that by having each of the small vectors output its "share"
> of the final 512 bit result into 4x 128 bit intermediates, then a combine step (probably multi-cycle
> in itself) that takes the 16 intermediates and merges them back into the final 512 bit vector.
>
> This would mean a 20 cycle operation done naïvely, but gets you slow and dependable AVX-512 on the E cores
> - and if performance matters, you're running on the P cores anyway. The optimization work would be to make
> those instructions energy efficient, for which you might introduce special cases - a full 512 bit ALU is a
> lot more costly in power and area than hardware dedicated to doing compress/expand/permute and nothing else.
I wonder how power-efficient that would be. One point of comparison is that M1's 4x128 NEON runs our Quicksort at about half the speed of SKX AVX-512. This doesn't entirely vindicate smaller vectors though, because M1's clock frequency and single-core memory bandwidth are higher, and the constant factors for 128-bit sorting networks are smaller (so we're not actually comparing 512-bit with quad-pumped 512-bit).
To be clear, I'd still rather have your "quad-pumped AVX-512 on E cores" than nothing or AVX2. Even better if it has actual 512-bit shuffle networks. The question is: who can say what kind of hardware we are actually going to get? And what if, as Brendan(?) says, one feature (think TSX) has to be disabled on a certain type of core? Should we then disable it on all, or do some legwork in the scheduler to honor a "don't move between CPU type" request?
> [..] you can do
> an arbitrary horizontal operation like that by having each of the small vectors output its "share"
> of the final 512 bit result into 4x 128 bit intermediates, then a combine step (probably multi-cycle
> in itself) that takes the 16 intermediates and merges them back into the final 512 bit vector.
>
> This would mean a 20 cycle operation done naïvely, but gets you slow and dependable AVX-512 on the E cores
> - and if performance matters, you're running on the P cores anyway. The optimization work would be to make
> those instructions energy efficient, for which you might introduce special cases - a full 512 bit ALU is a
> lot more costly in power and area than hardware dedicated to doing compress/expand/permute and nothing else.
I wonder how power-efficient that would be. One point of comparison is that M1's 4x128 NEON runs our Quicksort at about half the speed of SKX AVX-512. This doesn't entirely vindicate smaller vectors though, because M1's clock frequency and single-core memory bandwidth are higher, and the constant factors for 128-bit sorting networks are smaller (so we're not actually comparing 512-bit with quad-pumped 512-bit).
To be clear, I'd still rather have your "quad-pumped AVX-512 on E cores" than nothing or AVX2. Even better if it has actual 512-bit shuffle networks. The question is: who can say what kind of hardware we are actually going to get? And what if, as Brendan(?) says, one feature (think TSX) has to be disabled on a certain type of core? Should we then disable it on all, or do some legwork in the scheduler to honor a "don't move between CPU type" request?