Geekbench 4

By: Linus Torvalds (torvalds.delete@this.linux-foundation.org), July 29, 2016 2:28 pm
Room: Moderated Discussions
John Poole (john.delete@this.primatelabs.com) on July 29, 2016 1:51 pm wrote:
>
> We create one large allocation (200MB) then subdivide that allocation into 1MB blocks. Each
> linked-list node in each 1MB block refers to another node in the block (save for the last node
> which points to the next block). This is how we attempt to minimize the TLB misses.
>
> We randomize the order of the linked-list nodes within each block using a shuffle which in turn
> uses the Mersenne Twister as the PRNG. This should produce a pattern that, overall, will confuse
> the prefetcher. Each node is 64 bytes so two nodes should not occupy the same cache line. Each
> node in the 200MB allocation is accessed once before we start over at the beginning of the allocation
> so everything should fall out of cache even on systems with a large L4$.
>

Ok, that on the whole sounds very good. I think it would be better still if you at least made the nodes 128 bytes in size, just to make sure that a simple stupid linear prefetching doesn't actually end up helping.

With the 1MB "sub-blocking", most modern caches will easily catch the whole sub-block, and while cachelines may be 64 bytes, a mindless prefetch of two lines (or more) will still capture a lot of the latency in your test despite the randomization (not immediately, but soon in the sub-block area.

Which would give you falsely low latencies for a noticeable portion of the accesses.

If you want to keep the memory use dense, then instead of just making the node size bigger, you could try to be clever, and do something like a 5MB sub-blocking, but then split each 5MB sub-block into 5 parts (just alternating 64-bit cachelines for each sub-block). So you'd get five separate 1MB sub-blocks that you do your randomized linked list order within. And then just make sure that those sub-blocks are far enough apart from each other temporally in the bigger chain that there is very little chance of cache re-use.

(The odd "split 5MB into 5 sub-blocks" is just to avoid having the cachelines always be four apart when you create the sub-block: that could potentially be very bad for caches with limited ways, and could otherwise force every line into the same L1 way or something like that).

But just making the nodes 128 bytes would also help defeat simple prefetchers showing up as artificially low latency, and is obviously much simpler (likely just changing a constant in your code).

Finally, it may be worth it to move the actual list entry around within the node, just in case there are cases where people will do critical-word-first, and there's a timing difference between "beginning of cacheline" and "end of cacheline".

In fact, with variations of the "split a larger area into many sub-blocks", you could do both the prefetch avoidance and avoid the "list entry is always in the same position in the cacheline". You could make the node size be just 16 bytes, for example, and just make sure that you don't re-use any of the nodes close to it in the same sub-blocked chain.

So you could do that 5MB allocation, then just split it in 16-byte nodes, then stripe those by some prime number (say, 11), and now you have 11 independently randomized lists that don't have any cacheline overlap in any of the lists because the nodes are all 176 bytes away from each other.

Then, with 40 of those 5MB allocations, keep each of the lists that share an allocation 40 accesses away (and pick a random order of the 11 chains within the allocation when you connect to other allocations).

I'm not sure I explained that sufficiently, but it should get you fairly dense list use for a TLB (it does assume that the TLB can cover 5MB of memory - which may or may not be true - you may want to tweak the numbers a bit), while now allowing the end result to be captured in caches or prefetching.

The problem, of course, is that depending on your test platforms, you'll never see the prefetch or critical-word-first issues on most of them, and then some platforms will just look artificially good. So trying to avoid that kind of problem by being careful from the start would be a good thing.

There are probably many other things I've missed. There's a reason why memory latency testing is hard.

Linus
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Geekbench 4John Poole2016/07/28 08:12 PM
  Geekbench 4anon2016/07/28 10:58 PM
    Geekbench 4Doug S2016/07/29 09:08 AM
      Geekbench 4John Poole2016/07/29 11:57 AM
        Geekbench 4someone2016/07/29 12:58 PM
        Geekbench 4Linus Torvalds2016/07/29 01:01 PM
          Geekbench 4John Poole2016/07/29 01:51 PM
            Geekbench 4Linus Torvalds2016/07/29 02:28 PM
        Geekbench 4Doug S2016/07/29 07:59 PM
          Geekbench 4Yoav2016/07/29 11:22 PM
            Geekbench 4Yoav2016/07/29 11:22 PM
        Geekbench 4anon2016/07/29 11:03 PM
          Geekbench 4Doug S2016/07/30 10:06 AM
            Geekbench 4Gabriele Svelto2016/07/30 12:49 PM
            Geekbench 4Maynard Handley2016/07/30 05:45 PM
        libjpeg, LLVM and sqliteGabriele Svelto2016/07/29 11:26 PM
        Geekbench 4none2016/07/31 10:48 AM
        Xcode7( llvm3.7?) vs Clang 3.8 xx2016/08/11 11:18 PM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell green?