By: Travis (travis.downs.delete@this.gmail.com), April 26, 2017 12:50 pm
Room: Moderated Discussions
Travis (travis.downs.delete@this.gmail.com) on April 26, 2017 1:00 pm wrote:
> anon (spam.delete.delete@this.this.spam.com) on April 26, 2017 11:35 am wrote:
> > Can anyone verify this and/or test SKL?
>
> I'll run Skylake.
For Skylake I get the following.
Case 18: Zeroing AVX XOR (i.e, vxorps ymm1, ymm1, ymm1)
Peaks at exactly 224 instructions, so the ROB size is found exactly by this test.
Between 216 and 224 the performance is already starting to degrade somewhat, but the spike is still quite sharp (for example, at 220 instructions, you are still much closer to the "fast" line than the slow), so the hardware is able to mostly use the ROB even at the limit in this test.
Case 19: Non-zeroing AVX XOR
Peaks at 150 instructions, which is quite consistent with a register file of 168 AVX regs (150 speculative + 16 committed = 166). This seems much better than the IVB and HSW results, so perhaps what has happened is that they are not longer using some of the SIMD PRF to store the top parts of regs in dirty mode (indeed, the behavior characteristics of mixed VEX and non-VEX has totally changed in Skylake: now you get merging ops when you are running non-VEX code with dirty upper state - see for example this question - so there is no longer a need to stash away dirty upper halves).
This peak is quite sharp here too - perf is still mostly bad at 148, but 143 is very fast. So there is a small region, perhaps, where PRF or ROB allocation isn't perfect near the limit.
Case 21: Mixed AVX and Integer
The peak is at 212, quite close to the ROB size of 224. Still we don't get all the way to 224, so like Henry noted there is some other limit beyond pure ROB.
Final interesting note: all of the zeroing-idiom, nop, or move-eliminated cases get pretty much to the ROB size (> 210 ops in the window). *Any* other test, even independent integer ops, or possibly-eliminated movs only get to nearly exactly 150 ops (just like the non-zeroing AVX). So there is some limit around 150 for both AVX and integer code. Given that the integer register file is 180 on Skylake it seems that the integer limit should be higher, but perhaps they have more non-speculative state.
You can see all the results and graphs here (for cases 18, 19, 21):
https://docs.google.com/spreadsheets/d/1rGT4sbf-szDMmoisMvJXl5h94yRhnyVWHyFgbRQvRWA/edit?usp=sharing
> anon (spam.delete.delete@this.this.spam.com) on April 26, 2017 11:35 am wrote:
> > Can anyone verify this and/or test SKL?
>
> I'll run Skylake.
For Skylake I get the following.
Case 18: Zeroing AVX XOR (i.e, vxorps ymm1, ymm1, ymm1)
Peaks at exactly 224 instructions, so the ROB size is found exactly by this test.
Between 216 and 224 the performance is already starting to degrade somewhat, but the spike is still quite sharp (for example, at 220 instructions, you are still much closer to the "fast" line than the slow), so the hardware is able to mostly use the ROB even at the limit in this test.
Case 19: Non-zeroing AVX XOR
Peaks at 150 instructions, which is quite consistent with a register file of 168 AVX regs (150 speculative + 16 committed = 166). This seems much better than the IVB and HSW results, so perhaps what has happened is that they are not longer using some of the SIMD PRF to store the top parts of regs in dirty mode (indeed, the behavior characteristics of mixed VEX and non-VEX has totally changed in Skylake: now you get merging ops when you are running non-VEX code with dirty upper state - see for example this question - so there is no longer a need to stash away dirty upper halves).
This peak is quite sharp here too - perf is still mostly bad at 148, but 143 is very fast. So there is a small region, perhaps, where PRF or ROB allocation isn't perfect near the limit.
Case 21: Mixed AVX and Integer
The peak is at 212, quite close to the ROB size of 224. Still we don't get all the way to 224, so like Henry noted there is some other limit beyond pure ROB.
Final interesting note: all of the zeroing-idiom, nop, or move-eliminated cases get pretty much to the ROB size (> 210 ops in the window). *Any* other test, even independent integer ops, or possibly-eliminated movs only get to nearly exactly 150 ops (just like the non-zeroing AVX). So there is some limit around 150 for both AVX and integer code. Given that the integer register file is 180 on Skylake it seems that the integer limit should be higher, but perhaps they have more non-speculative state.
You can see all the results and graphs here (for cases 18, 19, 21):
https://docs.google.com/spreadsheets/d/1rGT4sbf-szDMmoisMvJXl5h94yRhnyVWHyFgbRQvRWA/edit?usp=sharing