By: Travis (travis.downs.delete@this.gmail.com), December 20, 2017 7:52 pm
Room: Moderated Discussions
Travis (travis.downs.delete@this.gmail.com) on December 20, 2017 1:44 pm wrote:
Test code is available. Anyone with a recent x86 box, I'm interested in your results.
I built it on Linux only, but it should be easy to port it in Windows. You can just replace the huge_alloc call with malloc and delete a few other POSIXisms.
This tests the two-read scenario with (asm_pf) and without (asm) prefetching, for a large variety of offsets for the first and second read. It also similar 64-byte strided read tests with 1 (read1) and 2 (read2) reads in the loop. The former shows that you can get more than 32B bandwidth out of L2 (close to 64B if you disable L2 stream prefetcher), meaning that the conventional wisdom that the L1 can only accept a line from L2 or satifies reads from the core isn't correct at least on Skylake.
It takes about 2 minutes per test, you can use your machine while it's running (it's not really sensitive to load, unless there are more runnable processes than you have cores!).
If you want, test also with prefetching disabled. This works on Nehalem and newer:
undo it with:
You can try disabling different prefetchers - the only that helped for me was the L2 streamer:
My results are here.
Test code is available. Anyone with a recent x86 box, I'm interested in your results.
I built it on Linux only, but it should be easy to port it in Windows. You can just replace the huge_alloc call with malloc and delete a few other POSIXisms.
git clone https://github.com/travisdowns/bimodal-performance -b rwt
cd bimodal-performance
make
./offset-test.sh asm
./offset-test.sh asm_pf
./offset-test.sh read1
./offset-test.sh read2
This tests the two-read scenario with (asm_pf) and without (asm) prefetching, for a large variety of offsets for the first and second read. It also similar 64-byte strided read tests with 1 (read1) and 2 (read2) reads in the loop. The former shows that you can get more than 32B bandwidth out of L2 (close to 64B if you disable L2 stream prefetcher), meaning that the conventional wisdom that the L1 can only accept a line from L2 or satifies reads from the core isn't correct at least on Skylake.
It takes about 2 minutes per test, you can use your machine while it's running (it's not really sensitive to load, unless there are more runnable processes than you have cores!).
If you want, test also with prefetching disabled. This works on Nehalem and newer:
sudo wrmsr -a 0x1a4 "$((2#1111))"
undo it with:
sudo wrmsr -a 0x1a4 0
You can try disabling different prefetchers - the only that helped for me was the L2 streamer:
sudo wrmsr -a 0x1a4 "$((2#0001))"
My results are here.