Real World Technologies - Forums - Thread: TIL: something weird about vzeroall

By: Travis (travis.downs.delete@this.gmail.com), 2018-08-22 04:46 UTC

Messing around on Skylake W, I found something weird with vzeroall - but it turns out it applies to the Skylake client I've had for ages as well (but not Haswell).

After issuing a vzeroall, the zeroed registers (i.e., all of them except zmm{16-31}) appear to stay in some kind unusual state where accessing them sometimes takes an extra cycle. As soon as the register is used as a destination in any instruction, it exits the weird mode (this include even zeroing idioms like vpxor with itself).

For example, we expect a simple string of AVX2 additions like this:



vpaddq ymm0, ymm1, ymm0

to take exactly 1 cycle per addition, and indeed it almost always does.

On Skylake however, if vzeroall appears before this sequence, and nothing writes to ymm1, it will take 1.67 cycles per instruction, i.e., 3 instructions every 5 cycles. Same for 128-bit adds on xmm register. 512-bit additions (zmm registers) are slightly faster, at a consistent 1.5 cycles.

The effect happens only with vzeroall not vzeroupper.

The effect happens only with xmm, ymm, and zmm{0-15}, but not with zmm{16-31} - which aren't modified by vzeroall.

Huh.

It doesn't matter much in practice since using a register zeroed from vzeroall is not very common: you almost always need put something in a register before you use it as a source, and if you want zero you'll use vpxor or xorps whatever: and although these are zero latency zeroing idioms handled at rename, they don't produce the same effect.

So it's more of a curiosity - but what would produce that effect? Maybe it's something to do with the "in use" tracking used by XSAVEOPT (but Haswell had that and I don't see it on Haswell)? Maybe registers zeroed in this way are represented in some special way in the RAT which is slower in some cases.

Thread (1 posts)

Topic	Posted By	Posted
TIL: something weird about vzeroall	Travis	2018-08-22 04:46 UTC

Reply to this Topic
Name:
Email:
Topic:
Body:	No Text Travis (travis.downs.delete@this.gmail.com) on 2018-08-22 04:46 UTC wrote: > Messing around on Skylake W, I found something weird with vzeroall - but it turns > out it applies to the Skylake client I've had for ages as well (but not Haswell). > > After issuing a vzeroall, the zeroed registers (i.e., all of them except zmm{16-31}) > appear to stay in some kind unusual state where accessing them sometimes takes an extra > cycle. As soon as the register is used as a destination in any instruction, it exits > the weird mode (this include even zeroing idioms like vpxor with itself). > > For example, we expect a simple string of AVX2 additions like this: > > <code> > vpaddq ymm0, ymm1, ymm0 > </code> > > to take exactly 1 cycle per addition, and indeed it almost always does. > > On Skylake however, if vzeroall appears before this sequence, and nothing writes to ymm1, it will > take 1.67 cycles per instruction, i.e., 3 instructions every 5 cycles. Same for 128-bit adds on xmm > register. 512-bit additions (zmm registers) are slightly faster, at a consistent 1.5 cycles. > > The effect happens only with vzeroall not vzeroupper. > > The effect happens only with xmm, ymm, and zmm{0-15}, but > not with zmm{16-31} - which aren't modified by vzeroall. > > Huh. > > It doesn't matter much in practice since using a register zeroed from vzeroall is not > very common: you almost always need put something <i>in</i> a register before you use it as > a source, and if you want zero you'll use vpxor or xorps whatever: and although these > are zero latency zeroing idioms handled at rename, they don't produce the same effect. > > So it's more of a curiosity - but what would produce that effect? Maybe it's something to do with the > "in use" tracking used by XSAVEOPT (but Haswell had that and I don't see it on Haswell)? Maybe registers > zeroed in this way are represented in some special way in the RAT which is slower in some cases. > >
Explain 🐈🐕:	(no spaces, 6 letters, lowercase)

TIL: something weird about vzeroall

Editor’s Picks

The Common System Interface: Intel’s Future Interconnect

Intel’s Haswell CPU Microarchitecture

Intel’s Sandy Bridge Graphics Architecture