L2 bandwidth on Skylake

By: Travis Downs (travis.downs.delete@this.gmail.com), November 19, 2018 4:24 pm
Room: Moderated Discussions
The findings about L2 latency discussed in an earlier thread led me to guess that there is a mechanism to bypass data arriving from L2 (as a result of a demand miss in L1) directly to the receiving operation, in the same cycle that it is being inserted in the L1, perhaps on the bypass network. Of course, this only generally works for one load per cycle, since at most one line arrives from L2. If a second load tries to get in on the action in the same cycle, it is rejected and tries again (the 5ish cycle penalty described in that other thread).

This leads to the idea that the fastest way to access a block of data in the L2 using 32-byte reads isn't the usual linear sequence, but rather a pattern of 2 reads that miss in L1 (1 cycle each), and then 2 reads for earlier accessed lines and so hit in L1 (0.5 cycle each = 2 cycles total). As it turns out, this works: you can read 128 bytes in 3 cycles on Skylake, sustained, which is faster than even Intel claims as their sustained figure (29 bytes/cycle).

You can find the details here.

It might be useful information for some cache-blocked computation kernels that choose L2 as their target cache (this is the most common choice, apparently). You could also use it to write faster version of any linear-scan read-only algorithms that target L2ish sizes. Unfortunately, it's not that widely applicable: subsequent testing reveals that the behavior applies only to Skylake (and later uarches). Haswell and earlier don't seem to support more than 32 bytes per cycle from the L2 in any scenario. The implication is that something changed in SKL: the L2 interface was upgraded to sustain 64 bytes per cycle, no doubt in preparation for AVX-512 in SKX.

In SKX, this isn't all that interesting if you are targeting AVX-512 since you can use 64-byte loads, and approach the maximum 64-byte L2 bandwidth. However, it would still help in the scenario that you are sharing code between SKX systems and others that don't support AVX-512, without an AVX-512 specific compile.

Comments appreciated and questions accepted!
 Next Post in Thread >
TopicPosted ByDate
L2 bandwidth on SkylakeTravis Downs2018/11/19 04:24 PM
  L2 bandwidth on SkylakeTravis Downs2018/11/20 08:18 AM
  Very interesting findGabriele Svelto2018/11/21 04:46 AM
Reply to this Topic
Body: No Text
How do you spell purple?