By: Maynard Handley (name99.delete@this.name99.org), October 2, 2015 1:25 pm
Room: Moderated Discussions
Wilco (Wilco.Dijkstra.delete@this.ntlworld.com) on October 2, 2015 1:48 pm wrote:
> David Kanter (dkanter.delete@this.realworldtech.com) on October 1, 2015 4:49 pm wrote:
> > Maynard Handley (name99.delete@this.name99.org) on October 1, 2015 1:57 pm wrote:
>
> > > Of course Apple [and ARM in general] have the advantage of load-store pair which isn't
> > > perfect (eg it's not going to help your vector throughput) but certainly helps in a large
> > > set of common cases, and so reduces the pressure to amp up to Intel's 2+1 load/store.)
> >
> > Uh, LDP and STP mean its more likely that you want multiple load/store
> > units. I'm not 100% sure of the semantics and benefits.
>
> No - it means you can achieve twice the bandwidth from a single load/store unit, and thus reduce
> the need to add another. Many of the workloads in SPEC2006 do simple stride-1 accesses, so
> LDP/STP is extremely effective there - besides making function entry/exit and all the memcpy
> and string functions efficient of course. There is also the advantage of fewer instructions
> to decode, rename and execute, so even a design with 2 load/store units benefits.
>
> > For example, what happens if the pair loads target different pages?
> > You'd need to do two separate translations through the TLB.
>
> Then the same thing happens as for any other load or store that crosses a page
> or cacheline boundary. If you used 2 separate loads, you now have 3 accesses
> for the split case rather than 2, so a wider load is always cheaper.
>
> Wilco
>
Just to add one interesting fact. I said above that load-pair wouldn't help your vector throughput, but turns out I'm wrong. Pair loads can be applied not just to integer registers but also to FP and even 128-bit SIMD registers.
They can also be used with the address-update modes, which presumably generate THREE outputs. Jesus, ARM!
I assume those variants definitely ARE cracked and the assumption/hope is that the extra address register update can slide in somewhere given the width of the CPU execution engines.
There are also (pace SPEC2006 simple stride-1 accesses) non-temporal variants which I assume will at some point definitely not store in the L1 and maybe also not in the L2. No idea if any compilers use those yet, or if any CPUs handle them specially.
And the above all holds likewise for stores.
> David Kanter (dkanter.delete@this.realworldtech.com) on October 1, 2015 4:49 pm wrote:
> > Maynard Handley (name99.delete@this.name99.org) on October 1, 2015 1:57 pm wrote:
>
> > > Of course Apple [and ARM in general] have the advantage of load-store pair which isn't
> > > perfect (eg it's not going to help your vector throughput) but certainly helps in a large
> > > set of common cases, and so reduces the pressure to amp up to Intel's 2+1 load/store.)
> >
> > Uh, LDP and STP mean its more likely that you want multiple load/store
> > units. I'm not 100% sure of the semantics and benefits.
>
> No - it means you can achieve twice the bandwidth from a single load/store unit, and thus reduce
> the need to add another. Many of the workloads in SPEC2006 do simple stride-1 accesses, so
> LDP/STP is extremely effective there - besides making function entry/exit and all the memcpy
> and string functions efficient of course. There is also the advantage of fewer instructions
> to decode, rename and execute, so even a design with 2 load/store units benefits.
>
> > For example, what happens if the pair loads target different pages?
> > You'd need to do two separate translations through the TLB.
>
> Then the same thing happens as for any other load or store that crosses a page
> or cacheline boundary. If you used 2 separate loads, you now have 3 accesses
> for the split case rather than 2, so a wider load is always cheaper.
>
> Wilco
>
Just to add one interesting fact. I said above that load-pair wouldn't help your vector throughput, but turns out I'm wrong. Pair loads can be applied not just to integer registers but also to FP and even 128-bit SIMD registers.
They can also be used with the address-update modes, which presumably generate THREE outputs. Jesus, ARM!
I assume those variants definitely ARE cracked and the assumption/hope is that the extra address register update can slide in somewhere given the width of the CPU execution engines.
There are also (pace SPEC2006 simple stride-1 accesses) non-temporal variants which I assume will at some point definitely not store in the L1 and maybe also not in the L2. No idea if any compilers use those yet, or if any CPUs handle them specially.
And the above all holds likewise for stores.