By: noko (noko.delete@this.noko.com), August 15, 2022 3:37 pm
Room: Moderated Discussions
Wilco (wilco.dijkstra.delete@this.ntlworld.com) on August 15, 2022 2:00 pm wrote:
> Etienne (etienne_lorrain.delete@this.yahoo.fr) on August 15, 2022 7:21 am wrote:
> > Wilco (wilco.dijkstra.delete@this.ntlworld.com) on August 15, 2022 5:21 am wrote:
> > > Let me give you an example to show how wrong this is. Math functions require many double precision
> > > constants for the polynomial approximation. On AArch64 creating these constants in the instruction
> > > stream would require 4 MOV/MOVKs and 1 FMOV per constant - throughput is 1 per cycle on a wide
> > > core like Neoverse V1. Loading the immediates from memory is done at 6 per cycle.
> >
> > I think nobody really claimed loading immediate is faster from the code cache,
> > but most people claim pre-loading the right data cacheline is very difficult.
>
> Why would someone claim it is better to place large immediates in
> the instruction stream if they actually believe that is slower?!?
>
> Anyway there is nothing hard about 'preloading' immediates - they will be in the cache
> after the code has been executed once. While any cache could miss at any time, they
> work extremely well most of the time (a L1 miss/L2 hit is just 5/6 extra cycles).
> Branch prediction is much harder, and yet modern CPUs are pretty good at it.
>
> > Code cachelines are loaded ahead of need-time by different system, not only when the instruction
> > decoder see a jump at some address, but also by "load the next cacheline" just in case.
>
> Fetching more code still means more fetch cycles - there is no free lunch.
5-6 cycles is 20-48 instructions for Neoverse V1. More than enough to fit several 64-bit immediates in the instruction stream.
"But reordering" - it's *much* easier for a CPU to reorder immediate movs to issue immediately after rename; they obviously have no dependency on previous instructions. Whereas determining that a load can be reordered before a previous store is harder, to say the least. So for a lot of code, immediate movs can have effectively zero latency, where loading a constant might have enough latency to actually matter.
Yes, most CPUs have a higher throughput of 64-bit loads than can be achieved with immediate MOVs. But just how often is code throughput limited by fetch/decode without being hand-optimized to that point?
> Etienne (etienne_lorrain.delete@this.yahoo.fr) on August 15, 2022 7:21 am wrote:
> > Wilco (wilco.dijkstra.delete@this.ntlworld.com) on August 15, 2022 5:21 am wrote:
> > > Let me give you an example to show how wrong this is. Math functions require many double precision
> > > constants for the polynomial approximation. On AArch64 creating these constants in the instruction
> > > stream would require 4 MOV/MOVKs and 1 FMOV per constant - throughput is 1 per cycle on a wide
> > > core like Neoverse V1. Loading the immediates from memory is done at 6 per cycle.
> >
> > I think nobody really claimed loading immediate is faster from the code cache,
> > but most people claim pre-loading the right data cacheline is very difficult.
>
> Why would someone claim it is better to place large immediates in
> the instruction stream if they actually believe that is slower?!?
>
> Anyway there is nothing hard about 'preloading' immediates - they will be in the cache
> after the code has been executed once. While any cache could miss at any time, they
> work extremely well most of the time (a L1 miss/L2 hit is just 5/6 extra cycles).
> Branch prediction is much harder, and yet modern CPUs are pretty good at it.
>
> > Code cachelines are loaded ahead of need-time by different system, not only when the instruction
> > decoder see a jump at some address, but also by "load the next cacheline" just in case.
>
> Fetching more code still means more fetch cycles - there is no free lunch.
5-6 cycles is 20-48 instructions for Neoverse V1. More than enough to fit several 64-bit immediates in the instruction stream.
"But reordering" - it's *much* easier for a CPU to reorder immediate movs to issue immediately after rename; they obviously have no dependency on previous instructions. Whereas determining that a load can be reordered before a previous store is harder, to say the least. So for a lot of code, immediate movs can have effectively zero latency, where loading a constant might have enough latency to actually matter.
Yes, most CPUs have a higher throughput of 64-bit loads than can be achieved with immediate MOVs. But just how often is code throughput limited by fetch/decode without being hand-optimized to that point?