anon ( on January 22, 2020 7:06 pm wrote:
> Travis Downs ( on January 22, 2020 2:28 pm wrote:
> > anon ( on January 17, 2020 8:12 pm wrote:
> > > Travis Downs ( on January 17, 2020 10:55 am wrote:
> > > > should say: (new LINK)
> > >
> > > Have you tested if using new registers (xmm16-31) is any different from old xmms?
> >
> > The upper 16 registers they are different in that sense that dirtying
> > them doesn't cause you to suffer from the implicit widening effect.
> >
> > That is, if you dirty the upper bits of zmm0 to zmm15, all future SIMD and FP instructions will
> > be widened to 512-bits (yes, this means that 128-bit SIMD FP instructions will cause you to use
> > L2, they are just as heavy as 512-bit instructions despite calculating only 128 bits of result).
> >
> > However, if you dirty the uppers of zmm16 to 31, this effect doesn't happen: there is no
> > implicit widening. This is probably because legacy instructions only access 0-15 and the
> > whole vzeroupper and associated tracking and merging scenarios applies only to 0-15.
> >
> > I believe the same is true for ymm16-31 too: if you dirty those there is no implicit
> > widening to 256-bits for subsequent instructions. I haven't tested it though.
> >
> > Note that this applies to the dirtying instructions, not
> > the subsequent "widened" instructions. If you dirty the
> > uppers of 0-15, then use xmm16+ or ymm16+, implicit widening stills occurs, since it is a CPU-wide state.
> >
> > Does it answer your question?
> >
> >
> Yes. I was just curios if there is any downside to using
> extra registers, but it seems that this is a strict win.

Well one possible downside is that sometimes the EVEX-encoded instructions (needed to access xmm16+) are an extra byte or so (but sometimes they are shorter) than their VEX equivalents.

Not all AVX/AVX2 instructions are available for use the the new registers. E.g., in my test for this stuff I had vpcmpeqd xmm0, xmm0, xmm0 to set the register to all ones, but this instruction is not available for xmm16+ as there is no EVEX encoded version (because all the EVEX comparisons are into a mask). So if you are primarily writing AVX/2 that might be annoying.

Finally, I assume if the compressed xsave/xrestor methods are being used they only write registers that have been dirtied, so using more registers could lead to, e.g., slightly slower context switches.
