By: Adrian (a.delete@this.acm.org), September 26, 2022 11:10 am
Room: Moderated Discussions
Adrian (a.delete@this.acm.org) on September 26, 2022 12:00 pm wrote:
> Jörn Engel (joern.delete@this.purestorage.com) on September 26, 2022 10:56 am wrote:
> > hobold (hobold.delete@this.vectorizer.org) on September 26, 2022 10:33 am wrote:
> > >
> > > Single cycle throughput, byte granularity 512bits wide general permute?
> > > And it is the full one, able to mix bytes from two sources.
> > >
> > > That's a game changer. Dang it!
> >
> > But vpcompress is microcoded and awful. vpexpand probably
> > as well. That's a game changer I wasn't hoping for.
>
>
> It is said that vpexpand is fast, even with a memory operand.
>
> Only vpcompress is slow and only when the destination is in memory.
>
I want to add that the puzzling fact that the microcoded execution of vpcompress with a memory operand is very slow in comparison with the alternative instruction sequences that emulate it using vpcompress with a register destination, and in comparison with the other variants of vpcompress and vpexpand, makes me think that the microcode execution was not intentional.
Maybe having fast vpcompress and vpexpand with both register and memory operands was the initial intention, but then a bug has been discovered in the vpcompress with a memory operand and the instruction was patched with a microcode sequence that is suboptimal due to some unknown constraints for the patch.
> Jörn Engel (joern.delete@this.purestorage.com) on September 26, 2022 10:56 am wrote:
> > hobold (hobold.delete@this.vectorizer.org) on September 26, 2022 10:33 am wrote:
> > >
> > > Single cycle throughput, byte granularity 512bits wide general permute?
> > > And it is the full one, able to mix bytes from two sources.
> > >
> > > That's a game changer. Dang it!
> >
> > But vpcompress is microcoded and awful. vpexpand probably
> > as well. That's a game changer I wasn't hoping for.
>
>
> It is said that vpexpand is fast, even with a memory operand.
>
> Only vpcompress is slow and only when the destination is in memory.
>
I want to add that the puzzling fact that the microcoded execution of vpcompress with a memory operand is very slow in comparison with the alternative instruction sequences that emulate it using vpcompress with a register destination, and in comparison with the other variants of vpcompress and vpexpand, makes me think that the microcode execution was not intentional.
Maybe having fast vpcompress and vpexpand with both register and memory operands was the initial intention, but then a bug has been discovered in the vpcompress with a memory operand and the instruction was patched with a microcode sequence that is suboptimal due to some unknown constraints for the patch.