By: Simon Farnsworth (simon.delete@this.farnz.org.uk), May 25, 2022 2:27 am
Room: Moderated Discussions
Jörn Engel (joern.delete@this.purestorage.com) on May 25, 2022 12:09 am wrote:
> Simon Farnsworth (simon.delete@this.farnz.org.uk) on May 24, 2022 2:14 pm wrote:
> >
> > There's a simpler reason why it makes no sense - it's not that difficult in hardware to use a narrow
> > vector ALU (128 bit, say) and multiple clock cycles to do wide operations.
>
> Mostly, yes. For the majority of instructions, you can treat each 128-bit lane independently.
> There are a few troublemakers like compress, expand and permute that cannot trivially be turned
> into a loop over 4 independent 128-bit lanes. Those exceptions may cost as much developer time
> as everything else combined. They probably don't cost too much space or power on the chip.
Validation is likely to be the hard part for those instructions; at the trivial level, you can do an arbitrary horizontal operation like that by having each of the small vectors output its "share" of the final 512 bit result into 4x 128 bit intermediates, then a combine step (probably multi-cycle in itself) that takes the 16 intermediates and merges them back into the final 512 bit vector.
This would mean a 20 cycle operation done naïvely, but gets you slow and dependable AVX-512 on the E cores - and if performance matters, you're running on the P cores anyway. The optimization work would be to make those instructions energy efficient, for which you might introduce special cases - a full 512 bit ALU is a lot more costly in power and area than hardware dedicated to doing compress/expand/permute and nothing else.
> Simon Farnsworth (simon.delete@this.farnz.org.uk) on May 24, 2022 2:14 pm wrote:
> >
> > There's a simpler reason why it makes no sense - it's not that difficult in hardware to use a narrow
> > vector ALU (128 bit, say) and multiple clock cycles to do wide operations.
>
> Mostly, yes. For the majority of instructions, you can treat each 128-bit lane independently.
> There are a few troublemakers like compress, expand and permute that cannot trivially be turned
> into a loop over 4 independent 128-bit lanes. Those exceptions may cost as much developer time
> as everything else combined. They probably don't cost too much space or power on the chip.
Validation is likely to be the hard part for those instructions; at the trivial level, you can do an arbitrary horizontal operation like that by having each of the small vectors output its "share" of the final 512 bit result into 4x 128 bit intermediates, then a combine step (probably multi-cycle in itself) that takes the 16 intermediates and merges them back into the final 512 bit vector.
This would mean a 20 cycle operation done naïvely, but gets you slow and dependable AVX-512 on the E cores - and if performance matters, you're running on the P cores anyway. The optimization work would be to make those instructions energy efficient, for which you might introduce special cases - a full 512 bit ALU is a lot more costly in power and area than hardware dedicated to doing compress/expand/permute and nothing else.