By: anon2 (anon.delete@this.anon.com), August 11, 2022 6:25 pm
Room: Moderated Discussions
--- (---.delete@this.redheron.com) on August 11, 2022 5:43 pm wrote:
> Anon (no.delete@this.spam.com) on August 11, 2022 11:53 am wrote:
> > Simon Farnsworth (simon.delete@this.farnz.org.uk) on August 11, 2022 5:26 am wrote:
> > > You misunderstand me - the measured IPC will be much lower than one, because
> > > there is no ILP (hence it is impossible to get an IPC greater than 1).
> > >
> > > I deliberately said the "maximum possible IPC", not the "measured IPC".
> >
> > Still wrong, instruction latency may be lower than 1, but I understand, you are talking about 0 ILP.
> >
> > Even that is not tied to frequency because of instruction
> > latency (including load latency, which is quite high
> > on M1 because of large L1 but could be much lower if they
> > decided to use a smaller L1 cache), to be honest Apple
> > does not attempt to decrease instruction latency and rely on
> > ILP, but that does not prove at all that high frequency
> > helps for this particular case, in fact, if someone decides
> > to maximize performance of 0 ILP workloads a valid
> > strategy would be to aim a very low frequency to reduce pipelining
> > overhead and trying to execute multiple dependent
> > instructions per clock, pipelining relies on ILP to compensate for the added overhead.
>
> I've no interest in the bizarre contortions of this argument, but you are objectively wrong here.
> M1 integer load latency is 4 cycles, pointer chasing latency is 3 cycles. There's
> basically zero scope to shave this if you want both a TLB and a write queue.
Wrong. POWER7 had a 2-cycle load to use @ 4.31GHz on 45nm process.
The answer is simply that it has not proven to be worthwhile for perf/watt to drive this latency down so far.
> Anon (no.delete@this.spam.com) on August 11, 2022 11:53 am wrote:
> > Simon Farnsworth (simon.delete@this.farnz.org.uk) on August 11, 2022 5:26 am wrote:
> > > You misunderstand me - the measured IPC will be much lower than one, because
> > > there is no ILP (hence it is impossible to get an IPC greater than 1).
> > >
> > > I deliberately said the "maximum possible IPC", not the "measured IPC".
> >
> > Still wrong, instruction latency may be lower than 1, but I understand, you are talking about 0 ILP.
> >
> > Even that is not tied to frequency because of instruction
> > latency (including load latency, which is quite high
> > on M1 because of large L1 but could be much lower if they
> > decided to use a smaller L1 cache), to be honest Apple
> > does not attempt to decrease instruction latency and rely on
> > ILP, but that does not prove at all that high frequency
> > helps for this particular case, in fact, if someone decides
> > to maximize performance of 0 ILP workloads a valid
> > strategy would be to aim a very low frequency to reduce pipelining
> > overhead and trying to execute multiple dependent
> > instructions per clock, pipelining relies on ILP to compensate for the added overhead.
>
> I've no interest in the bizarre contortions of this argument, but you are objectively wrong here.
> M1 integer load latency is 4 cycles, pointer chasing latency is 3 cycles. There's
> basically zero scope to shave this if you want both a TLB and a write queue.
Wrong. POWER7 had a 2-cycle load to use @ 4.31GHz on 45nm process.
The answer is simply that it has not proven to be worthwhile for perf/watt to drive this latency down so far.