By: --- (---.delete@this.redheron.com), August 11, 2022 4:43 pm
Room: Moderated Discussions
Anon (no.delete@this.spam.com) on August 11, 2022 11:53 am wrote:
> Simon Farnsworth (simon.delete@this.farnz.org.uk) on August 11, 2022 5:26 am wrote:
> > You misunderstand me - the measured IPC will be much lower than one, because
> > there is no ILP (hence it is impossible to get an IPC greater than 1).
> >
> > I deliberately said the "maximum possible IPC", not the "measured IPC".
>
> Still wrong, instruction latency may be lower than 1, but I understand, you are talking about 0 ILP.
>
> Even that is not tied to frequency because of instruction latency (including load latency, which is quite high
> on M1 because of large L1 but could be much lower if they decided to use a smaller L1 cache), to be honest Apple
> does not attempt to decrease instruction latency and rely on ILP, but that does not prove at all that high frequency
> helps for this particular case, in fact, if someone decides to maximize performance of 0 ILP workloads a valid
> strategy would be to aim a very low frequency to reduce pipelining overhead and trying to execute multiple dependent
> instructions per clock, pipelining relies on ILP to compensate for the added overhead.
I've no interest in the bizarre contortions of this argument, but you are objectively wrong here.
M1 integer load latency is 4 cycles, pointer chasing latency is 3 cycles. There's basically zero scope to shave this if you want both a TLB and a write queue.
Apple has implemented a large number of techniques to reduce instruction latency, from aggressive fusion to zero cycle moves and immediates to a variety of zero cycle loads.
The only place where you can reasonably say that they have not tried to reduce latency is on the FP/SIMD side, and I've already explained that multiple times.
> Simon Farnsworth (simon.delete@this.farnz.org.uk) on August 11, 2022 5:26 am wrote:
> > You misunderstand me - the measured IPC will be much lower than one, because
> > there is no ILP (hence it is impossible to get an IPC greater than 1).
> >
> > I deliberately said the "maximum possible IPC", not the "measured IPC".
>
> Still wrong, instruction latency may be lower than 1, but I understand, you are talking about 0 ILP.
>
> Even that is not tied to frequency because of instruction latency (including load latency, which is quite high
> on M1 because of large L1 but could be much lower if they decided to use a smaller L1 cache), to be honest Apple
> does not attempt to decrease instruction latency and rely on ILP, but that does not prove at all that high frequency
> helps for this particular case, in fact, if someone decides to maximize performance of 0 ILP workloads a valid
> strategy would be to aim a very low frequency to reduce pipelining overhead and trying to execute multiple dependent
> instructions per clock, pipelining relies on ILP to compensate for the added overhead.
I've no interest in the bizarre contortions of this argument, but you are objectively wrong here.
M1 integer load latency is 4 cycles, pointer chasing latency is 3 cycles. There's basically zero scope to shave this if you want both a TLB and a write queue.
Apple has implemented a large number of techniques to reduce instruction latency, from aggressive fusion to zero cycle moves and immediates to a variety of zero cycle loads.
The only place where you can reasonably say that they have not tried to reduce latency is on the FP/SIMD side, and I've already explained that multiple times.