By: Travis Downs (travis.downs.delete@this.gmail.com), September 20, 2018 3:33 pm
Room: Moderated Discussions
Wilco (Wilco.dijkstra.delete@this.ntlworld.com) on September 20, 2018 2:32 am wrote:
> The 4-cycle path might only work within the load/store unit because of timing.
> Forwarding within a unit is faster than from a different unit.
>
> However it's not clear this is the case, you need sequences like alu->load5->load->alu
> to check whether the load->alu latency can ever be 4 cycles.
I wrote two tests, with a 5-cycle load feeding a 4-cycle load then ALU op, and swapping the two loads (both are looped, so the the alu op feeds back into the load).
The results:
The alu 1 op was one cycle. So it definitely seems to be that the input of the fast load comes from another load, not the output.
> The 4-cycle path might only work within the load/store unit because of timing.
> Forwarding within a unit is faster than from a different unit.
>
> However it's not clear this is the case, you need sequences like alu->load5->load->alu
> to check whether the load->alu latency can ever be 4 cycles.
I wrote two tests, with a 5-cycle load feeding a 4-cycle load then ALU op, and swapping the two loads (both are looped, so the the alu op feeds back into the load).
The results:
load5 -> load4 -> alu 10.00
load4 -> load5 -> alu 11.00
The alu 1 op was one cycle. So it definitely seems to be that the input of the fast load comes from another load, not the output.