Zen+ compared to Skylake

By: Travis (travis.downs.delete@this.gmail.com), May 20, 2018 1:38 pm
Room: Moderated Discussions
bakaneko (nyan.delete@this.hyan.wan) on May 20, 2018 7:25 am wrote:

> Anyway, here the full thing if you want to look through it:
> https://pastebin.com/CM6puF0Q

Thanks for these results. Can you tell me what Zen+ chip you have?

After a brief scan comparing to results to Skylake, here are my notes (these are largely similar to original Zen):

  • Zen seems to only be able to a 64-bit multiply every 2 clocks, compared to Skylake (and many earlier Intel generations) which do them at 1 per clock. I don't remember that from the earlier Ryzen results and haven't seen it mentioned before, so it might be a test artifact somehow.

  • Zen has quite a bit more of a glass jaw when it comes to unaligned loads and especially stores. On recent Intel, there is no penalty for any load or store, no matter the alignment as long as the access doesn't cross a cache line (64-byte) boundary. If it does, it simply costs about the same as 2 accesses.

    Zen on the other hand also has a penalty for any load that crosses a 32-byte boundary, and any load that crosses a 16, 32, 48 or 64-byte boundary. The penalty itself is also must larger for stores: only one such store per 5 cycles (except in the special case where an 8-byte store exactly straddles a boundary with 4 bytes on each side: in this case the store takes only 2 cycles). I don't know how the penalized and non-penalized stores overlap if they occur in sequence, but it means that average cost of a 64-bit store over all alignments is 1.1 cycles for Intel and 2.6 cycles for AMD. The average penalty is smaller for smaller stores and larger for larger (SIMD) stores.

  • Store forwarding latency seems to be 7 cycles generally on Zen+, with some cases around 6 cycles if the load happens to occur at just the right time (5 or 6 cycles after) the load. The situation on Skylake is more complex, but it's generally faster at around 4.5 cycles or as low as 3 cycles if the timing is right (if the load issues 3 cycles after the store).

  • Zen seems to handle software prefetching differently than Intel. On recent Intel, software prefetches are treated as a "must complete" and behave in some ways like loads without a destination register: i.e., they cannot retire until the data has been fetched. So a dense stream of prefetches behaves much like a dense stream of loads from the same cache level. On Zen, prefetches seem to execute 2 per cycle, even when striding over large regions where 2 cache lines per cycle throughput cannot be sustained. My theory would be that if all the fill buffers are full, Zen drops prefetches rather than waiting for a free buffer as Intel does. Both strategies are reasonable and allowed by the ISA documentation. The Intel approach is probably a bit better for carefully tuned kernels where the prefetch instructions are nearly all ultimately useful and issued at the right time, whereas the (alleged) AMD approach favors "real" loads over prefetches and is probably useful in cases where a developer has put in some best-effort prefetches but where the data isn't always subsequently used or where there may be less tuning.

  • Zen has 12-cycle L2 latency and throughput of 1 cache line per 2 cycles. Intel is similar, but has a bit better throughput at about 1 line per 1.66 cycles. Given that the AMD's L2 is twice as big, you'd probably have to give AMD an overall edge in L2 performance (of course, Skylake-X now has a 1MB L2).

  • Zen L3 performance has a high apparent throughput of 2 cycles per cache line, similar to L2. This is for a stride over a 2 MB region. This is much better than Intel's line per ~4.8 cycles, but I'm not sure we can can conclude much here since the core is downclocked to 1.5 GHz, and L3 is an offcore thing that probably runs on its own clock. So the performance may only be great because we are measuring in 1.5 GHz cycles something that largely hasn't been downclocked. A similar comment applies to latency.

  • Ryzen has might higher throughput for popcnt, tzcnt, lzcnt and some related instrutions that Skylake and has no false-dependency issue with these instructions.

  • The latency of cache-line crossing misaligned loads (as opposed to throughput, discussed above) is considerably better on Zen than Skylake, with only 1 cycle penalty compared to 6 cycles for Skylake.

On thing that strikes me based on the above is yet another kind of built-in advantage the incumbent like Intel has. People (including compiler writers) make optimization decisions based on the existing deployed hardware (and sometimes just the hardware they are using), which tends to bias the current crop of software (and benchmarks) to the performance profile of the dominant uarch: if another uarch picks a different set of trade-offs, that might actually be better overall, it won't pan out that way on today's software which is optimized for the other behavior.

For example, for a long time unaligned memory accesses really sucked and people avoided them. The last few Intel uarchs, however, have reduced the cost for most types of unaligned accesses almost to zero, so you have been a shift in hand-optimized code and (more slowly) compiler heuristics towards more unaligned accesses and fewer big-and-slow prolog and epilog code sections to align mode, etc. Since AMD has chosen a different tradeoff there, they will probably be hurt somewhat by that change. On other hand, people aren't going to generally do something that is slow on Intel even if it's fast on AMD, people aren't going to target a 512 KiB L2 versus 256 KiB except in very specialized cases, etc.

Now Ryzen is actually quite close to the performance profile of Skylake, both in absolute terms but more importantly in the types of things that are fast and slow, so this probably won't hurt them too much: but one has to wonder if they feel somewhat restricted in their design in the sense that they are compelled to be fast on code that is written to be fast on Intel.

To some extent this even constraints Intel: it would be hard for them to make a radical change that makes some things much fast and others much slower, that they believe to be better overall (once software has adjusted), because it will test poorly on release because existing software and benchmarks are written for the chips of yesterday. So they are compelled to at least hold the line kind of across the board performance-wise (and indeed, the last many generations of chips have largely seen removal of glass-jaw type slowdowns, rather than radical speedups in any area). It's not as much of a constraint though, since they can still release something that holds the line but is faster is some mostly-unused-but-key area, and then expect people to take advantage of it over time. AMD, not so much (at current levels of market share, anyways).

< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
measuring cache latency is hard :-/hobold2018/05/17 08:54 AM
  measuring cache latency is hard :-/Michael_S2018/05/17 09:33 AM
    measuring cache latency is hard :-/hobold2018/05/17 10:00 AM
      measuring cache latency is hard :-/Michael S2018/05/17 11:57 AM
        measuring cache latency is hard :-/hobold2018/05/17 07:28 PM
  measuring cache latency is hard :-/Linus Torvalds2018/05/17 10:53 AM
    measuring cache latency is hard :-/anon2018/05/17 06:44 PM
      measuring cache latency is hard :-/Travis2018/05/17 07:01 PM
        Misread it, thank you (NT)anon2018/05/17 07:17 PM
  measuring cache latency is hard :-/juanrga2018/05/17 12:37 PM
    measuring cache latency is hard :-/hobold2018/05/17 07:50 PM
      measuring cache latency is hard :-/juanrga2018/05/20 03:07 AM
        measuring cache latency is hard :-/hobold2018/05/20 02:05 PM
  measuring cache latency is hard :-/Travis2018/05/17 06:09 PM
    note that this effect does not persist across reboot (NT)Travis2018/05/17 06:10 PM
  measuring cache latency is hard :-/Travis2018/05/17 06:57 PM
  uarch-bench latency testTravis2018/05/17 11:37 PM
    uarch-bench latency testAdrian2018/05/18 12:31 AM
      uarch-bench latency testTravis2018/05/18 01:00 AM
        uarch-bench latency testAdrian2018/05/18 04:31 AM
          uarch-bench latency testAdrian2018/05/18 04:47 AM
            uarch-bench latency testTravis2018/05/18 07:01 PM
              uarch-bench latency testAdrian2018/05/18 09:23 PM
    uarch-bench latency testFoo_2018/05/18 04:05 AM
      uarch-bench latency testTravis2018/05/18 07:51 AM
        uarch-bench latency testFoo_2018/05/18 07:57 AM
          uarch-bench latency testTravis2018/05/18 07:10 PM
            uarch-bench latency testbakaneko2018/05/19 02:24 AM
              uarch-bench latency testbakaneko2018/05/19 02:30 AM
                uarch-bench latency testTravis2018/05/19 08:28 PM
                  thanks for the link to the code converterbakaneko2018/05/20 06:37 AM
              uarch-bench latency testTravis2018/05/19 08:35 PM
                uarch-bench latency testbakaneko2018/05/20 07:25 AM
                  uarch-bench latency testTravis2018/05/20 10:58 AM
                  Zen+ compared to SkylakeTravis2018/05/20 01:38 PM
                    Zen+ compared to SkylakeLinus Torvalds2018/05/20 03:18 PM
                      Zen+ compared to SkylakeTravis2018/05/20 07:45 PM
                      L0 scratchpad archBrett2018/05/20 08:00 PM
                        L0 scratchpad archanon2018/05/20 09:52 PM
                        Loads cost 10 times the power of floating point math?Michael_S2018/05/21 01:07 AM
                          Loads cost 10 times the power of floating point math?Maynard Handley2018/05/21 02:09 PM
                            Another random internet contribution ...Mark Roulo2018/05/21 04:39 PM
                              Another random internet contribution ...Brett2018/05/27 02:22 PM
                                Another random internet contribution ...Jacob Marley2018/05/27 02:58 PM
                                  Another random internet contribution ...Brett2018/05/27 10:05 PM
                                    Small allocations do not dictate 16 byte is bestBrett2018/05/27 10:32 PM
                                      Small allocations do not dictate 16 byte is bestGabriele Svelto2018/05/28 12:25 AM
                                    Another random internet contribution ...Jacob Marley2018/05/28 01:34 AM
                                Another random internet contribution ...wumpus2018/05/27 04:05 PM
                            Loads cost 10 times the power of floating point math?Michael_S2018/05/22 03:10 AM
                              Loads cost 10 times the power of floating point math?dmcq2018/05/22 05:41 AM
                                Loads cost 10 times the power of floating point math?Travis2018/05/22 10:44 PM
                                  Loads cost 10 times the power of floating point math?dmcq2018/05/23 03:37 AM
                                    Loads cost 10 times the power of floating point math?Travis2018/05/23 10:55 AM
                        L0 scratchpad archwumpus2018/05/21 08:41 AM
                          L0 scratchpad archBrett2018/05/21 10:22 PM
                      Zen+ compared to SkylakeVoxilla2018/05/22 12:16 AM
                    Zen+ compared to Skylakeitsmydamnation2018/05/20 03:25 PM
                      Zen+ compared to SkylakeBigos2018/05/20 04:40 PM
                    Zen+ compared to Skylakebakaneko2018/05/20 05:04 PM
                      Zen+ compared to SkylakeTravis2018/05/20 07:59 PM
                        Zen+ compared to Skylakeanon2018/05/21 01:44 AM
                        Zen+ compared to SkylakeHeikki Kultala2018/05/21 01:41 PM
                          Zen+ compared to SkylakeTravis2018/05/21 02:52 PM
                            Zen+ compared to SkylakeJeff S.2018/05/22 07:28 AM
                              *typo*Jeff S.2018/05/22 12:49 PM
          uarch-bench latency testTravis2018/05/20 12:11 PM
        Disabling p-states on RyzenFoo_2018/05/18 08:03 AM
          Disabling p-states on RyzenTravis2018/05/18 07:28 PM
            Disabling p-states on RyzenFoo_2018/05/19 09:36 AM
              Disabling p-states on RyzenTravis2018/05/19 08:15 PM
Reply to this Topic
Body: No Text
How do you spell tangerine? ūüćä