By: Jörn Engel (joern.delete@this.purestorage.com), May 25, 2022 12:09 am
Room: Moderated Discussions
Simon Farnsworth (simon.delete@this.farnz.org.uk) on May 24, 2022 2:14 pm wrote:
>
> There's a simpler reason why it makes no sense - it's not that difficult in hardware to use a narrow
> vector ALU (128 bit, say) and multiple clock cycles to do wide operations.
Mostly, yes. For the majority of instructions, you can treat each 128-bit lane independently. There are a few troublemakers like compress, expand and permute that cannot trivially be turned into a loop over 4 independent 128-bit lanes. Those exceptions may cost as much developer time as everything else combined. They probably don't cost too much space or power on the chip.
>
> There's a simpler reason why it makes no sense - it's not that difficult in hardware to use a narrow
> vector ALU (128 bit, say) and multiple clock cycles to do wide operations.
Mostly, yes. For the majority of instructions, you can treat each 128-bit lane independently. There are a few troublemakers like compress, expand and permute that cannot trivially be turned into a loop over 4 independent 128-bit lanes. Those exceptions may cost as much developer time as everything else combined. They probably don't cost too much space or power on the chip.