By: ll (ll.delete.delete.delete@this.this.this.gmail.com), August 5, 2019 6:05 am
Travis Downs (travis.downs.delete@this.gmail.com) on August 4, 2019 3:00 pm wrote:
> ll (ll.delete.delete@this.this.gmail.com) on August 3, 2019 8:37 pm wrote:
> > the latency for skylake is strange
> > for mov r64,[m64] + mov [m64], r64, the uop should be this:
> > load r64, [m64]
> > sta
> > std
> > for sta and std could isuss in same cycle, as load has 4 cycle latency, and also assume sta and
> > std is four cycle, after sta and std, load of the next loop could get data through store to load
> > forward, then for one loop, 8 cycle would be max cycle it take, but the result is 19 cycle.
> Yes, I don't know where the 19 cycles comes from, or even the 10-11 cycles seem in other
> dumps. Store forwarding latency is generally between 3-6 cycles on modern Intel and the
> straightforward loop should achieve that. Maybe the loaded value is somehow used in the
> store addressing calculation which slows things down. That is, a loop like:
> mov eax, [rdi + rbx]
> mov [rdi + rbx], eax

> is quite different than:
> mov eax, [rdi + rbx]
> mov [rdi + rax], eax

> because the second case the store address can't be calculated until eax is available,
> which itself comes from the store-load chain and it slows things down a lot.

the latency depend on which part of the store depend on load,sta or std. but in any case,the 19 cycle is not a reasonable result. zen has a latency about 10 cycle, it seems that this result is more reasonable
