By: Travis Downs (, June 19, 2019 1:15 pm
Why is Intel gather latency so damn high on x86?

On Intel chips the latency depends only on the number of loaded elements, not their width or the address width[1] and on SKL 2, 4 or 8 elements the latency is 18, 20 and 22 cycles respectively.

On SKX, the latency dropped by a cycle and you also have 16 element gathers, and the latency is 17, 19, 21 and 25 respectively.

All of these take 1 load uop per element, plus ~3 other ops total for p015, probably to merge the results.

It looks like the latency differences between the various element counts is mostly explained by the additional load uops: starting 8 extra loads takes 4 cycles, so 21 + 4 = 25. Same for 4 vs 8 elements (2 vs 4 is off by a cycle though, not all that weird).

So there is some kind of baseline latency of 17-18 cycles for any gather, in addition to the time to issue all the load uops. What could cause that?

On Zen, the latency is similar (actually a couple cycles better in many cases), but the throughput is terrible: similar to the latency. Gathers use up to 65 uops there, so it's a crummy implementation but at least everything makes sense: latency is that long because the total amount of work to do is huge and you are limited by execution throughput to chew through dozens of uops.


[1] That is, the QQ DQ and QD forms for a given vector size load the same number of elements and have the same latency. Similarly the QQ form for ymm registers loads 4 elements just like the DD form for xmm regs, and they have the same latency.
TopicPosted ByDate
gather latencyTravis Downs2019/06/19 01:15 PM
  gather throughputMichael S2019/06/19 01:55 PM
    gather throughputEric Bron2019/06/19 02:59 PM
      gather throughputMichael S2019/06/20 12:57 AM
        gather throughputEric Bron2019/06/20 02:11 AM
    gather throughputTravis Downs2019/06/19 05:47 PM
      gather throughputMichael S2019/06/20 02:37 AM
  gather latencyLinus Torvalds2019/06/19 06:02 PM
    gather latencyTravis Downs2019/06/19 07:31 PM
      gather latencyanon2019/06/20 03:48 AM
        gather latencyTravis Downs2019/06/20 09:07 AM
          gather latencyanon2019/06/20 10:34 AM
            gather latencyTravis Downs2019/06/20 01:25 PM
              gather latencyanon2019/06/20 02:05 PM
                gather latencyTravis Downs2019/06/20 05:27 PM
                  gather latencyanon2019/06/21 01:31 AM
                    gather latencyMichael S2019/06/21 02:37 AM
                      gather latencyanon2019/06/21 04:20 AM
                        gather latencyMichael S2019/06/21 04:24 AM
                          gather latencyanon2019/06/21 04:48 AM
                            gather latencyTravis Downs2019/06/21 08:10 AM
                              gather latencyanon2019/06/21 08:50 AM
                                gather latencyMichael S2019/06/21 09:21 AM
                                  gather latencyanon2019/06/21 09:58 AM
                                gather latencyTravis Downs2019/06/21 09:39 AM
                                  gather latencyanon2019/06/21 10:16 AM
                                    gather latencyTravis Downs2019/06/21 11:51 AM
                                      gather latencyanon2019/06/21 01:38 PM
                                        gather latencyTravis Downs2019/06/21 01:53 PM
                                          gather latencyanon2019/06/21 02:44 PM
                                            You could be right, I am not sure (NT)Travis Downs2019/06/21 02:46 PM
                gather latencyTravis Downs2019/06/20 06:34 PM
                  gather latencyMichael S2019/06/21 02:45 AM
                    gather latencyTravis Downs2019/06/21 09:04 AM
