By: noko (noko.delete@this.noko.com), October 2, 2015 5:45 pm
Room: Moderated Discussions
Maynard Handley (name99.delete@this.name99.org) on October 2, 2015 2:25 pm wrote:
> Wilco (Wilco.Dijkstra.delete@this.ntlworld.com) on October 2, 2015 1:48 pm wrote:
> > David Kanter (dkanter.delete@this.realworldtech.com) on October 1, 2015 4:49 pm wrote:
> > > Maynard Handley (name99.delete@this.name99.org) on October 1, 2015 1:57 pm wrote:
> >
> > > > Of course Apple [and ARM in general] have the advantage of load-store pair which isn't
> > > > perfect (eg it's not going to help your vector throughput) but certainly helps in a large
> > > > set of common cases, and so reduces the pressure to amp up to Intel's 2+1 load/store.)
> > >
> > > Uh, LDP and STP mean its more likely that you want multiple load/store
> > > units. I'm not 100% sure of the semantics and benefits.
> >
> > No - it means you can achieve twice the bandwidth from a single load/store unit, and thus reduce
> > the need to add another. Many of the workloads in SPEC2006 do simple stride-1 accesses, so
> > LDP/STP is extremely effective there - besides making function entry/exit and all the memcpy
> > and string functions efficient of course. There is also the advantage of fewer instructions
> > to decode, rename and execute, so even a design with 2 load/store units benefits.
> >
> > > For example, what happens if the pair loads target different pages?
> > > You'd need to do two separate translations through the TLB.
> >
> > Then the same thing happens as for any other load or store that crosses a page
> > or cacheline boundary. If you used 2 separate loads, you now have 3 accesses
> > for the split case rather than 2, so a wider load is always cheaper.
> >
> > Wilco
> >
>
> Just to add one interesting fact. I said above that load-pair wouldn't help your
> vector throughput, but turns out I'm wrong. Pair loads can be applied not just
> to integer registers but also to FP and even 128-bit SIMD registers.
> They can also be used with the address-update modes, which presumably generate THREE outputs. Jesus, ARM!
Vector load/store instructions can write/read up to 4 vector registers in one instruction in arm64, and can generate a surprising number of uops from a single instruction.
For Cortex-A57, the worst case is probably the Q-form 4-register ST4 variant with writeback, with 8 to 32 bit elements, which generates 8 store uops (store pipeline appears to be 64 bits wide), 8 vector permute uops (probably, I'm assuming it decodes to the equivalent of 8x zip1/zip2), plus the writeback uop, for 17 uops generated from a single instruction!
> Wilco (Wilco.Dijkstra.delete@this.ntlworld.com) on October 2, 2015 1:48 pm wrote:
> > David Kanter (dkanter.delete@this.realworldtech.com) on October 1, 2015 4:49 pm wrote:
> > > Maynard Handley (name99.delete@this.name99.org) on October 1, 2015 1:57 pm wrote:
> >
> > > > Of course Apple [and ARM in general] have the advantage of load-store pair which isn't
> > > > perfect (eg it's not going to help your vector throughput) but certainly helps in a large
> > > > set of common cases, and so reduces the pressure to amp up to Intel's 2+1 load/store.)
> > >
> > > Uh, LDP and STP mean its more likely that you want multiple load/store
> > > units. I'm not 100% sure of the semantics and benefits.
> >
> > No - it means you can achieve twice the bandwidth from a single load/store unit, and thus reduce
> > the need to add another. Many of the workloads in SPEC2006 do simple stride-1 accesses, so
> > LDP/STP is extremely effective there - besides making function entry/exit and all the memcpy
> > and string functions efficient of course. There is also the advantage of fewer instructions
> > to decode, rename and execute, so even a design with 2 load/store units benefits.
> >
> > > For example, what happens if the pair loads target different pages?
> > > You'd need to do two separate translations through the TLB.
> >
> > Then the same thing happens as for any other load or store that crosses a page
> > or cacheline boundary. If you used 2 separate loads, you now have 3 accesses
> > for the split case rather than 2, so a wider load is always cheaper.
> >
> > Wilco
> >
>
> Just to add one interesting fact. I said above that load-pair wouldn't help your
> vector throughput, but turns out I'm wrong. Pair loads can be applied not just
> to integer registers but also to FP and even 128-bit SIMD registers.
> They can also be used with the address-update modes, which presumably generate THREE outputs. Jesus, ARM!
Vector load/store instructions can write/read up to 4 vector registers in one instruction in arm64, and can generate a surprising number of uops from a single instruction.
For Cortex-A57, the worst case is probably the Q-form 4-register ST4 variant with writeback, with 8 to 32 bit elements, which generates 8 store uops (store pipeline appears to be 64 bits wide), 8 vector permute uops (probably, I'm assuming it decodes to the equivalent of 8x zip1/zip2), plus the writeback uop, for 17 uops generated from a single instruction!