You can do two 4-cycle loads per cycle

By: Travis Downs (travis.downs.delete@this.gmail.com), September 20, 2018 3:10 pm
Room: Moderated Discussions
anon (spam.delete.delete@this.this.spam.com) on September 20, 2018 1:34 am wrote:
> > The way I understood it is that if you mix 4 and 5 cycle loads, for example, in a "throughput" scenario,
> > your 4 cycle loads will often end up taking 5 cycles because
> > the are out of alignment with the 5 cycle loads
> > and use the same pipeline stages. In the example, the 4 cycle load can't start in the cycle after a 5 cycle
> > load because it wants the second part of the load pipeline which is what the 5 cycle load is using.
> >
> > So it turns into a 5 cycle load. It maybe gets even messier
> > if the skipped pipeline stages are somewhere in the middle.
> >
> > We do know that 4 cycle loads do play nice in a throughput scenario if there are only
> > 4 cycle loads around since 8 concurrent 4-cycle pointer chases do execute at 2 loads
> > per cycle. Maybe I could add some 5 cycle loads in there and see what happens.
>
> Yeah but even in the nice throughput scenario 4 cycle loads didn't happen with the adress
> coming from an ALU, right? So the different latency doesn't seem to be the problem.

Yes in that scenario all the loads were pure pointer chasing, load-feeds-load. There was no different latencies there though: everything was 4-cycle, so I'm not understanding what you are saying about different latencies.

I did now try more tests in the same vein with 8 to 10 parallel pointer chases, this time mixing 4 cycle and 5 cycle loads. The scenarios are at or close to 2 loads/cycle, so need high throughput, and they are also close to latency limited (i.e., if the latency is longer than expected it will also slow down the test).

Overall everything was very close to as fast as possible. For example, doing 10 parallel pointer chases, with every chain having a 9:1 ratio of 5-cycle to 4-cycle loads still achieved max throughput of 2 loads per cycle. Doing 8 parallel chains (barely latency limited), with similar 7:1 ratio of 5 to 4 cycle loads showed a result of less than 5 cycles, i.e., even the occasional 4 cycle load actually executes in 4 cycles even when it is heavily surrounded by 5 cycle loads, although the speedup was about half of what is expected.

Basically, it seems like mixing 4 and 5 cycle loads in heavy throughput scenarios doesn't really cause any problem: I never saw a case where a 4 cycle load was worse than forcing the same load to be 5 cycles, and in cases where it could be better it usually was.

All tests on SKL. You can run them in uarch-bench using ./uarch-bench.sh --test-name=*pointer-chase*.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
4-cycle L1 latency on Intel not as general as thoughTravis Downs2018/09/17 04:32 PM
  4-cycle L1 latency on Intel not as general as thoughanon2018/09/18 02:43 AM
    4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/18 09:39 AM
      4-cycle L1 latency on Intel not as general as thoughtanon2018/09/18 10:53 AM
        4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/18 11:07 AM
          4-cycle L1 latency on Intel not as general as thoughtanon2018/09/18 11:51 AM
            4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/18 01:52 PM
              4-cycle L1 latency on Intel not as general as thoughtanon2018/09/19 02:40 AM
                4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/19 05:20 PM
                  4-cycle L1 latency on Intel not as general as thoughtSeni2018/09/19 10:28 PM
                    4-cycle L1 latency on Intel not as general as thoughtGabriele Svelto2018/09/20 05:16 AM
                      4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/20 02:25 PM
                        4-cycle L1 latency on Intel not as general as thoughtGabriele Svelto2018/09/21 02:46 AM
                  4-cycle L1 latency on Intel not as general as thoughtanon2018/09/20 08:40 AM
                    4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/20 03:01 PM
    You can do two 4-cycle loads per cycleTravis Downs2018/09/18 10:58 AM
      You can do two 4-cycle loads per cycleanon2018/09/18 11:53 AM
        You can do two 4-cycle loads per cycleTravis Downs2018/09/18 12:29 PM
          You can do two 4-cycle loads per cycleanon2018/09/18 01:27 PM
            You can do two 4-cycle loads per cycleWilco2018/09/18 02:37 PM
              You can do two 4-cycle loads per cycleanon2018/09/19 02:45 AM
                You can do two 4-cycle loads per cycleTravis Downs2018/09/19 05:30 PM
                  You can do two 4-cycle loads per cycleanon2018/09/20 01:34 AM
                    You can do two 4-cycle loads per cycleWilco2018/09/20 02:32 AM
                      You can do two 4-cycle loads per cycleanon2018/09/20 04:35 AM
                      You can do two 4-cycle loads per cycleTravis Downs2018/09/20 03:33 PM
                    You can do two 4-cycle loads per cycleTravis Downs2018/09/20 03:10 PM
            You can do two 4-cycle loads per cycleTravis Downs2018/09/18 03:08 PM
              You can do two 4-cycle loads per cycleGabriele Svelto2018/09/19 01:39 AM
                You can do two 4-cycle loads per cycleTravis Downs2018/09/19 05:43 PM
              You can do two 4-cycle loads per cycleanon2018/09/19 02:42 AM
                You can do two 4-cycle loads per cycleTravis Downs2018/09/19 06:09 PM
                  You can do two 4-cycle loads per cycleanon2018/09/20 01:49 AM
                    You can do two 4-cycle loads per cycleTravis Downs2018/09/20 04:38 PM
                    You can do two 4-cycle loads per cycleTravis Downs2018/09/20 07:27 PM
                      You can do two 4-cycle loads per cycleanon2018/09/21 08:08 AM
            Separate RS for ALU vs load/storeTravis Downs2018/12/13 12:55 PM
              Separate RS for ALU vs load/storeanon2018/12/13 02:14 PM
              Separate RS for ALU vs load/storeanon.12018/12/13 09:15 PM
                Separate RS for ALU vs load/storeWilco2018/12/14 04:41 AM
                  Separate RS for ALU vs load/storeanon.12018/12/14 08:08 AM
                    Separate RS for ALU vs load/storeWilco2018/12/14 01:51 PM
              Integer divide also var latencyDavid Kanter2018/12/14 11:45 AM
                Integer divide also var latencyTravis Downs2018/12/14 09:09 PM
              Separate RS for ALU vs load/storeanon22018/12/14 09:57 PM
                Separate RS for ALU vs load/storeTravis Downs2018/12/15 11:00 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell avocado?