By: Travis Downs (travis.downs.delete@this.gmail.com), September 20, 2018 3:10 pm

Room: Moderated Discussions

anon (spam.delete.delete@this.this.spam.com) on September 20, 2018 1:34 am wrote:

> > The way I understood it is that if you mix 4 and 5 cycle loads, for example, in a "throughput" scenario,

> > your 4 cycle loads will often end up taking 5 cycles because

> > the are out of alignment with the 5 cycle loads

> > and use the same pipeline stages. In the example, the 4 cycle load can't start in the cycle after a 5 cycle

> > load because it wants the second part of the load pipeline which is what the 5 cycle load is using.

> >

> > So it turns into a 5 cycle load. It maybe gets even messier

> > if the skipped pipeline stages are somewhere in the middle.

> >

> > We do know that 4 cycle loads do play nice in a throughput scenario if there are

> > 4 cycle loads around since 8 concurrent 4-cycle pointer chases do execute at 2 loads

> > per cycle. Maybe I could add some 5 cycle loads in there and see what happens.

>

> Yeah but even in the nice throughput scenario 4 cycle loads didn't happen with the adress

> coming from an ALU, right? So the different latency doesn't seem to be the problem.

Yes in that scenario all the loads were pure pointer chasing, load-feeds-load. There was no different latencies there though: everything was 4-cycle, so I'm not understanding what you are saying about different latencies.

I did now try more tests in the same vein with 8 to 10 parallel pointer chases, this time mixing 4 cycle and 5 cycle loads. The scenarios are at or close to 2 loads/cycle, so need high throughput, and they are also close to latency limited (i.e., if the latency is longer than expected it will also slow down the test).

Overall everything was very close to as fast as possible. For example, doing 10 parallel pointer chases, with every chain having a 9:1 ratio of 5-cycle to 4-cycle loads still achieved max throughput of 2 loads per cycle. Doing 8 parallel chains (barely latency limited), with similar 7:1 ratio of 5 to 4 cycle loads showed a result of less than 5 cycles, i.e., even the occasional 4 cycle load actually executes in 4 cycles even when it is heavily surrounded by 5 cycle loads, although the speedup was about half of what is expected.

Basically, it seems like mixing 4 and 5 cycle loads in heavy throughput scenarios doesn't really cause any problem: I never saw a case where a 4 cycle load was worse than forcing the same load to be 5 cycles, and in cases where it could be better it usually was.

All tests on SKL. You can run them in uarch-bench using

> > The way I understood it is that if you mix 4 and 5 cycle loads, for example, in a "throughput" scenario,

> > your 4 cycle loads will often end up taking 5 cycles because

> > the are out of alignment with the 5 cycle loads

> > and use the same pipeline stages. In the example, the 4 cycle load can't start in the cycle after a 5 cycle

> > load because it wants the second part of the load pipeline which is what the 5 cycle load is using.

> >

> > So it turns into a 5 cycle load. It maybe gets even messier

> > if the skipped pipeline stages are somewhere in the middle.

> >

> > We do know that 4 cycle loads do play nice in a throughput scenario if there are

*only*> > 4 cycle loads around since 8 concurrent 4-cycle pointer chases do execute at 2 loads

> > per cycle. Maybe I could add some 5 cycle loads in there and see what happens.

>

> Yeah but even in the nice throughput scenario 4 cycle loads didn't happen with the adress

> coming from an ALU, right? So the different latency doesn't seem to be the problem.

Yes in that scenario all the loads were pure pointer chasing, load-feeds-load. There was no different latencies there though: everything was 4-cycle, so I'm not understanding what you are saying about different latencies.

I did now try more tests in the same vein with 8 to 10 parallel pointer chases, this time mixing 4 cycle and 5 cycle loads. The scenarios are at or close to 2 loads/cycle, so need high throughput, and they are also close to latency limited (i.e., if the latency is longer than expected it will also slow down the test).

Overall everything was very close to as fast as possible. For example, doing 10 parallel pointer chases, with every chain having a 9:1 ratio of 5-cycle to 4-cycle loads still achieved max throughput of 2 loads per cycle. Doing 8 parallel chains (barely latency limited), with similar 7:1 ratio of 5 to 4 cycle loads showed a result of less than 5 cycles, i.e., even the occasional 4 cycle load actually executes in 4 cycles even when it is heavily surrounded by 5 cycle loads, although the speedup was about half of what is expected.

Basically, it seems like mixing 4 and 5 cycle loads in heavy throughput scenarios doesn't really cause any problem: I never saw a case where a 4 cycle load was worse than forcing the same load to be 5 cycles, and in cases where it could be better it usually was.

All tests on SKL. You can run them in uarch-bench using

`./uarch-bench.sh --test-name=*pointer-chase*`

.