By: Adrian (a.delete@this.acm.org), November 7, 2022 9:07 am
Room: Moderated Discussions
Adrian (a.delete@this.acm.org) on November 7, 2022 3:19 am wrote:
> hobold (hobold.delete@this.vectorizer.org) on November 6, 2022 3:48 am wrote:
> > Adrian (a.delete@this.acm.org) on November 5, 2022 2:00 am wrote:
> >
> > [...]
> > > This method would be optimal, by requiring the least hardware, for code that only uses 512-bit instructions
> > > without interleaving them with 256-bit operations (which would be a very stupid programming style).
> >
> > SMT might interleave instruction streams that use vectors of different width, even
> > when each individual program is using a single fixed vector width exclusively.
>
>
> Good point.
>
>
> I am still not convinced that improving the throughput for this unusual workload is worth extra hardware.
>
I want to add that, on Zen 4 (like also on the Intel non-server CPUs, like Rocket Lake, Tiger Lake or Ice Lake Client), one thread that is a heavy user of AVX-512 should be able to saturate all the FP/SIMD execution pipelines. A second SMT thread also using AVX or AVX-512 would not be able to increase the throughput.
It is likely that, on Zen 4, when one thread of a core uses heavily AVX-512, the best throughput is obtained when the second thread is either halted or it executes a program that does not use any FP/SSE/AVX/AVX-512.
> hobold (hobold.delete@this.vectorizer.org) on November 6, 2022 3:48 am wrote:
> > Adrian (a.delete@this.acm.org) on November 5, 2022 2:00 am wrote:
> >
> > [...]
> > > This method would be optimal, by requiring the least hardware, for code that only uses 512-bit instructions
> > > without interleaving them with 256-bit operations (which would be a very stupid programming style).
> >
> > SMT might interleave instruction streams that use vectors of different width, even
> > when each individual program is using a single fixed vector width exclusively.
>
>
> Good point.
>
>
> I am still not convinced that improving the throughput for this unusual workload is worth extra hardware.
>
I want to add that, on Zen 4 (like also on the Intel non-server CPUs, like Rocket Lake, Tiger Lake or Ice Lake Client), one thread that is a heavy user of AVX-512 should be able to saturate all the FP/SIMD execution pipelines. A second SMT thread also using AVX or AVX-512 would not be able to increase the throughput.
It is likely that, on Zen 4, when one thread of a core uses heavily AVX-512, the best throughput is obtained when the second thread is either halted or it executes a program that does not use any FP/SSE/AVX/AVX-512.