By: --- (---.delete@this.redheron.com), August 11, 2022 7:07 pm
Room: Moderated Discussions
Anon (no.delete@this.spam.com) on August 11, 2022 6:20 pm wrote:
> --- (---.delete@this.redheron.com) on August 11, 2022 5:43 pm wrote:
> > M1 integer load latency is 4 cycles, pointer chasing latency is 3 cycles.
>
> 4 cycles at 3GHz is high latency in my book.
>
> > There's
> > basically zero scope to shave this if you want both a TLB and a write queue.
>
> Already done.
>
> > Apple has implemented a large number of techniques to reduce instruction latency, from aggressive
> > fusion to zero cycle moves and immediates to a variety of zero cycle loads.
>
> Good, but everbody does instruction fusion and move elimination this day, Apple does not
> implement aggressive forms of latency reduction like 0.5 cycles ALU instructions.
>
Apple APPEARS to be planning to fuse arithmetic+logic operations. The infrastructure is present in the A14 LLV checkins, but does not appear to be implemented as of M1.
This would give you .5 cycle ALUs...
Of course who ever knows the long term plan, but I see this as preparing the compiler, in advance, so that when this is added (A15/M2?, A16/M3?) there's already a large body of binaries optimized to take advantage of it.
> --- (---.delete@this.redheron.com) on August 11, 2022 5:43 pm wrote:
> > M1 integer load latency is 4 cycles, pointer chasing latency is 3 cycles.
>
> 4 cycles at 3GHz is high latency in my book.
>
> > There's
> > basically zero scope to shave this if you want both a TLB and a write queue.
>
> Already done.
>
> > Apple has implemented a large number of techniques to reduce instruction latency, from aggressive
> > fusion to zero cycle moves and immediates to a variety of zero cycle loads.
>
> Good, but everbody does instruction fusion and move elimination this day, Apple does not
> implement aggressive forms of latency reduction like 0.5 cycles ALU instructions.
>
Apple APPEARS to be planning to fuse arithmetic+logic operations. The infrastructure is present in the A14 LLV checkins, but does not appear to be implemented as of M1.
This would give you .5 cycle ALUs...
Of course who ever knows the long term plan, but I see this as preparing the compiler, in advance, so that when this is added (A15/M2?, A16/M3?) there's already a large body of binaries optimized to take advantage of it.