Cannonlake notes

By: Travis Downs (travis.downs.delete@this.gmail.com), December 31, 2018 12:47 am
Room: Moderated Discussions
Some miscellaneous notes from recent testing on CNL that may be of interest:

The popcnt false dependency is finally gone. Previously, popcnt would falsely depend on the value of its write-only destination register, leading to either performance problems for compilers unaware of this, or jumping through hoops to break the dependency such as a zeroing of the output register prior to the popcnt. This seems to have been introduced in Sandy Bridge, and it also subsequently appeared for lzcnt and tzcnt (but those were fixed in Skylake), so it's nice to get a fix even if it took the better part of a decade. More background at [1].

The weirdness described at [2] where using registers zeroed with zveroall sometimes resulted in unexpectedly low throughput/high latency seems to have disapeared: such loops now run at the expected 1 cycle per dependent addition. I guess this implies some uarch change somewhere, but I never had a solid explanation for that effect in the first place, so it's hard to say more.

The LSD is back. This speeds up some loops, especially those with taken branches inside the loop like so:


.top:
xor eax, eax
jz .l1
nop
.l1:
dec rdi
jne .top


That runs in 3 cycles/iteration out of the DSB but 2 cycles/iteration with the LSD. Other loops, on the other hand, may be slightly slowed down (more detatils at [3]). Presumably power is being saved!

The divider unit is dramatically better, as reported elsewhere. Unlike previous versions of div, the timings aren't data dependent: the performance is the same regardless of the inputs. For both div and idiv I measure 10 cycles recipriocal throughput and 15 cycles latency for the quotient (in rax) and 18 cycles latency for the remainder (in rdx), and only 4 uops: 3 on p1 and 1 on p056 (compared to 25-40 uops on earlier uarches). Perhaps integer division is now sharing the FP divider hardware, since DP division has has similar performance characteristics (parital pipelining, overall latency) and is only somewhat easier (53-bit mantissa).

The available per-thread MLP (outstanding L1 misses) has apparently roughly doubled to 20+ misses.




[1] https://stackoverflow.com/q/21390165/149138
[2] https://github.com/travisdowns/uarch-bench/wiki/Intel-Performance-Quirks#registers-zeroed-via-vzeroall-are-sometimes-slower-to-use-as-source-operands
[3] https://stackoverflow.com/q/39311872/149138
 Next Post in Thread >
TopicPosted ByDate
Cannonlake notesTravis Downs2018/12/31 12:47 AM
  HAPPY NEW YEARPer Hesselgren2018/12/31 04:21 PM
  L1 misses!!!David Kanter2018/12/31 05:14 PM
    L1 misses!!!Eric Bron2019/01/01 06:04 AM
      L1 misses!!!Travis Downs2019/01/06 10:45 AM
    L1 misses!!!Travis Downs2019/01/06 10:54 AM
  No AVX-512 speed tiers (no L1/L2 license)Travis Downs2019/01/06 10:59 AM
    No AVX-512 speed tiers (no L1/L2 license)Keivn G2019/01/06 05:53 PM
      No AVX-512 speed tiers (no L1/L2 license)Travis Downs2019/01/06 06:12 PM
  CNL ROB and PRF sizesTravis Downs2019/01/20 06:51 PM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell green?