By: Travis Downs (travis.downs.delete@this.gmail.com), February 26, 2019 3:23 pm
Room: Moderated Discussions
So after discussing a tangentially related matter, I went back to this old issue and ran the reproduction case again, on the same box (Skylake client i7-6700HQ) I collected all the original results. Since then I've updated the microcode several times as new version have been release (thanks, Spectre).
To be surprise, I cannot reproduce the original "bimodality" at all any more. The tests are always consistent: and they are always slow. Before, I would often get the slow result of 7-10 cycles per cache line like the yellow and red values here and also sometimes the much slower result of 16-20 cycles per line (the scattered purple values).
Lo and behold, now I only ever get the really slow values around 18 cycles - my offset vs timing chart looks like this rather than this.
I went and tested Skylake-X and CNL and they both had more or less had only the faster (still slow) timings around 6-8 cycles, like so for CNL.
If anyone has a Skylake client box kicking around, I'd be really interesting in your results and your microcode version. Basically:
and share the result along with the result of:
Linux-only, sorry for any Windows or Mac users who wanted to get in on the fun (in principle the code should run fine there though, minus the optional page-info stuff - ports welcome).
Maybe there was some microcode change which reduced performance for this type of load? Or maybe my box is just weird somehow?
To be surprise, I cannot reproduce the original "bimodality" at all any more. The tests are always consistent: and they are always slow. Before, I would often get the slow result of 7-10 cycles per cache line like the yellow and red values here and also sometimes the much slower result of 16-20 cycles per line (the scattered purple values).
Lo and behold, now I only ever get the really slow values around 18 cycles - my offset vs timing chart looks like this rather than this.
I went and tested Skylake-X and CNL and they both had more or less had only the faster (still slow) timings around 6-8 cycles, like so for CNL.
If anyone has a Skylake client box kicking around, I'd be really interesting in your results and your microcode version. Basically:
git clone https://github.com/travisdowns/bimodal-performance
git checkout rwt
make
./offset-test.sh
and share the result along with the result of:
egrep -m2 'model name|micro' /proc/cpuinfo
Linux-only, sorry for any Windows or Mac users who wanted to get in on the fun (in principle the code should run fine there though, minus the optional page-info stuff - ports welcome).
Maybe there was some microcode change which reduced performance for this type of load? Or maybe my box is just weird somehow?