By: Travis (travis.downs.delete@this.gmail.com),
Room: Moderated Discussions
Messing around on Skylake W, I found something weird with vzeroall - but it turns out it applies to the Skylake client I've had for ages as well (but not Haswell).
After issuing a vzeroall, the zeroed registers (i.e., all of them except zmm{16-31}) appear to stay in some kind unusual state where accessing them sometimes takes an extra cycle. As soon as the register is used as a destination in any instruction, it exits the weird mode (this include even zeroing idioms like vpxor with itself).
For example, we expect a simple string of AVX2 additions like this:
to take exactly 1 cycle per addition, and indeed it almost always does.
On Skylake however, if vzeroall appears before this sequence, and nothing writes to ymm1, it will take 1.67 cycles per instruction, i.e., 3 instructions every 5 cycles. Same for 128-bit adds on xmm register. 512-bit additions (zmm registers) are slightly faster, at a consistent 1.5 cycles.
The effect happens only with vzeroall not vzeroupper.
The effect happens only with xmm, ymm, and zmm{0-15}, but not with zmm{16-31} - which aren't modified by vzeroall.
Huh.
It doesn't matter much in practice since using a register zeroed from vzeroall is not very common: you almost always need put something in a register before you use it as a source, and if you want zero you'll use vpxor or xorps whatever: and although these are zero latency zeroing idioms handled at rename, they don't produce the same effect.
So it's more of a curiosity - but what would produce that effect? Maybe it's something to do with the "in use" tracking used by XSAVEOPT (but Haswell had that and I don't see it on Haswell)? Maybe registers zeroed in this way are represented in some special way in the RAT which is slower in some cases.
After issuing a vzeroall, the zeroed registers (i.e., all of them except zmm{16-31}) appear to stay in some kind unusual state where accessing them sometimes takes an extra cycle. As soon as the register is used as a destination in any instruction, it exits the weird mode (this include even zeroing idioms like vpxor with itself).
For example, we expect a simple string of AVX2 additions like this:
vpaddq ymm0, ymm1, ymm0
to take exactly 1 cycle per addition, and indeed it almost always does.
On Skylake however, if vzeroall appears before this sequence, and nothing writes to ymm1, it will take 1.67 cycles per instruction, i.e., 3 instructions every 5 cycles. Same for 128-bit adds on xmm register. 512-bit additions (zmm registers) are slightly faster, at a consistent 1.5 cycles.
The effect happens only with vzeroall not vzeroupper.
The effect happens only with xmm, ymm, and zmm{0-15}, but not with zmm{16-31} - which aren't modified by vzeroall.
Huh.
It doesn't matter much in practice since using a register zeroed from vzeroall is not very common: you almost always need put something in a register before you use it as a source, and if you want zero you'll use vpxor or xorps whatever: and although these are zero latency zeroing idioms handled at rename, they don't produce the same effect.
So it's more of a curiosity - but what would produce that effect? Maybe it's something to do with the "in use" tracking used by XSAVEOPT (but Haswell had that and I don't see it on Haswell)? Maybe registers zeroed in this way are represented in some special way in the RAT which is slower in some cases.
Thread (1 posts)
| Topic | Posted By | Posted |
|---|---|---|
| TIL: something weird about vzeroall | Travis |


