By: Adrian (a.delete@this.acm.org), November 8, 2022 12:45 pm
Room: Moderated Discussions
Linus Torvalds (torvalds.delete@this.linux-foundation.org) on November 8, 2022 10:35 am wrote:
> Adrian (a.delete@this.acm.org) on November 8, 2022 8:53 am wrote:
> >
> > I am skeptical that AMD has chosen the variant with sequential processing of the halves, because
> > that creates problems for the few instructions that need to access both halves.
>
> So I would actually love to hear that that is what AMD does,
> because I think it's conceptually a lovely model.
>
> It's literally the original traditional vector model, where you treat vectors not as one thing, but as a
> sequence of things. That's a model that actually scales, in that it doesn't penalize the smaller case.
>
> I'm certainly on record as not being a huge fan of AVX512, but any implementation
> that makes the effort to also scale down is a good implementation in my book.
>
Actually you have a point here, because all the AMD GPUs use various forms of implementing operations with wide vectors by serial execution in a narrower pipeline.
The fact that they could have easily reused in Zen 4 their experience from their GPUs increases the likelihood that they might have a serial implementation of the AVX-512 operations.
Nevertheless, there is an important difference between Zen 4 and either the AMD GPUs or a cheaper CPUs, like the Intel Gracemont core.
Someone who would make a design from scratch of a cheaper CPU core or of a throughput-oriented CPU core that has to implement an ISA with wide vectors would certainly do exactly as you say, they would choose a width for the execution pipeline determined by the desired cost, which could be as low as 64-bit, or even 32-bit, and then they would implement operations with any vector width specified in the ISA by serial execution. This would minimize the area and the power consumption.
However, in Zen 4 AMD already had pairs of identical 256-bit pipelines on which the 512-bit operations had to be implemented. In this special case, with already existing pipelines whose aggregated width matches the width of the extra operations that must be implemented, it is likely that it is actually cheaper to implement simultaneous execution instead of serial execution and that the simultaneous execution might improve the performance of a small number of instructions.
We cannot know for sure which variant is cheaper, because this depends on the physical layout of the execution pipelines. The variant with simultaneous execution might need a little less logic, but it may happen to need a larger area for routing.
> Adrian (a.delete@this.acm.org) on November 8, 2022 8:53 am wrote:
> >
> > I am skeptical that AMD has chosen the variant with sequential processing of the halves, because
> > that creates problems for the few instructions that need to access both halves.
>
> So I would actually love to hear that that is what AMD does,
> because I think it's conceptually a lovely model.
>
> It's literally the original traditional vector model, where you treat vectors not as one thing, but as a
> sequence of things. That's a model that actually scales, in that it doesn't penalize the smaller case.
>
> I'm certainly on record as not being a huge fan of AVX512, but any implementation
> that makes the effort to also scale down is a good implementation in my book.
>
Actually you have a point here, because all the AMD GPUs use various forms of implementing operations with wide vectors by serial execution in a narrower pipeline.
The fact that they could have easily reused in Zen 4 their experience from their GPUs increases the likelihood that they might have a serial implementation of the AVX-512 operations.
Nevertheless, there is an important difference between Zen 4 and either the AMD GPUs or a cheaper CPUs, like the Intel Gracemont core.
Someone who would make a design from scratch of a cheaper CPU core or of a throughput-oriented CPU core that has to implement an ISA with wide vectors would certainly do exactly as you say, they would choose a width for the execution pipeline determined by the desired cost, which could be as low as 64-bit, or even 32-bit, and then they would implement operations with any vector width specified in the ISA by serial execution. This would minimize the area and the power consumption.
However, in Zen 4 AMD already had pairs of identical 256-bit pipelines on which the 512-bit operations had to be implemented. In this special case, with already existing pipelines whose aggregated width matches the width of the extra operations that must be implemented, it is likely that it is actually cheaper to implement simultaneous execution instead of serial execution and that the simultaneous execution might improve the performance of a small number of instructions.
We cannot know for sure which variant is cheaper, because this depends on the physical layout of the execution pipelines. The variant with simultaneous execution might need a little less logic, but it may happen to need a larger area for routing.