By: Wilco (Wilco.Dijkstra.delete@this.ntlworld.com), October 2, 2015 4:06 pm
Room: Moderated Discussions
David Kanter (dkanter.delete@this.realworldtech.com) on October 2, 2015 4:15 pm wrote:
> Wilco (Wilco.Dijkstra.delete@this.ntlworld.com) on October 2, 2015 3:59 pm wrote:
> > David Kanter (dkanter.delete@this.realworldtech.com) on October 2, 2015 2:59 pm wrote:
> > > Wilco (Wilco.Dijkstra.delete@this.ntlworld.com) on October 2, 2015 1:48 pm wrote:
> > > > David Kanter (dkanter.delete@this.realworldtech.com) on October 1, 2015 4:49 pm wrote:
> > > > > Maynard Handley (name99.delete@this.name99.org) on October 1, 2015 1:57 pm wrote:
> > > >
> > > > > > Of course Apple [and ARM in general] have the advantage of load-store pair which isn't
> > > > > > perfect (eg it's not going to help your vector throughput) but certainly helps in a large
> > > > > > set of common cases, and so reduces the pressure to amp up to Intel's 2+1 load/store.)
> > > > >
> > > > > Uh, LDP and STP mean its more likely that you want multiple load/store
> > > > > units. I'm not 100% sure of the semantics and benefits.
> > > >
> > > > No - it means you can achieve twice the bandwidth from a single load/store unit, and thus reduce
> > > > the need to add another.
> > >
> > > Can LDP/STP only target integer registers? The load/store units are
> > > only 128b wide, so you can't really do a double vector load.
> >
> > No, LDP/STP support integer, FP and vector registers. You can use LDP and STP even
> > if a CPU can't do it in a single cycle. This means it will run faster on some implementations
> > while having no negative effect on implementations that can't.
>
> Thanks for clearing that up!
>
> > > > Many of the workloads in SPEC2006 do simple stride-1 accesses, so
> > > > LDP/STP is extremely effective there -
> > >
> > > What % of loads and stores are pairs?
> >
> > I don't have detailed figures, but it's surprisingly common in hot loops in many benchmarks.
>
> I'm more curious about something like GCC that at least resembles normal
> code. Using LDP in Geekbench isn't actually helpful for real code.
GCC does do a lot of function calls. Not sure whether there are performance counters that can count load vs LDP, but a static count should give a reasonable idea anyway given GCC is not loop heavy.
> > > >besides making function entry/exit and all the memcpy
> > > > and string functions efficient of course.
> > >
> > > Memory and string functions should be using 128b vectors anyway. I see the
> > > biggest use case being for function calls and exits. But I could be convinced
> > > otherwise with data (or at least a better explanation of how it works).
> >
> > Vector instructions have much higher latencies than integer ones, and most strings
> > are fairly short, so they are not always the best option for string/mem functions.
>
> That's fair. But for a memcopy, I'd expect that vector ops are the way to go if you need to clear a page.
For clearing there is a special clear instruction - current cores clear 64-128 bytes per instruction as fast as L1 cache can write back into L2.
> > > > There is also the advantage of fewer instructions
> > > > to decode, rename and execute, so even a design with 2 load/store units benefits.
> > >
> > > Does it truly decode into 1 uop on modern uarchs? I am a bit skeptical, since
> > > it targets two registers, and there is the potential for line/page crossing.
> >
> > Yes the most common cases do indeed. Eg. Cortex-A57 can execute a long sequence of LDP's
> > of 2 64-bit registers at 1 cycle per instruction as long as it hits L1 and doesn't cross
> > a cacheline boundary. There aren't even any penalties for unaligned accesses.
>
> Wait. How does the decoder detect common cases? That doesn't make sense.
The decoder can decide whether to split based on writeback and/or width and type of the access. LDP of 2 32/64-bit registers is most common, so Cortex-A57 executes 32/64-bit LDPs in 1 cycle. 128-bit LDP takes 2 cycles (the manual doesn't say whether 1 or 2 micro-ops are used).
> It sounds like there is some logic in the pipeline for replays.
Every OoO core must support replays to deal with L1 and TLB misses. Doing a pipeline flush would be insane.
> > > > > For example, what happens if the pair loads target different pages?
> > > > > You'd need to do two separate translations through the TLB.
> > > >
> > > > Then the same thing happens as for any other load or store that crosses a page
> > > > or cacheline boundary. If you used 2 separate loads, you now have 3 accesses
> > > > for the split case rather than 2, so a wider load is always cheaper.
> > >
> > > My point is that handling this as two uops instead of one
> > > is much easier. I suspect the semantics for LDP/STP
> > > specifically do not guarantee atomic behavior, which makes me suspect it decodes into two uops.
> >
> > An implementation is allowed to decode into 2 uops indeed, but it would be a bad idea
> > to split 32-bit or 64-bit LDP/STP if you can already deal with 128-bit Q registers.
>
> Not all implementations have full width 128b datapath from the L1D.
Low-end implementations won't indeed, but they are less likely to be OoO.
Wilco
> Wilco (Wilco.Dijkstra.delete@this.ntlworld.com) on October 2, 2015 3:59 pm wrote:
> > David Kanter (dkanter.delete@this.realworldtech.com) on October 2, 2015 2:59 pm wrote:
> > > Wilco (Wilco.Dijkstra.delete@this.ntlworld.com) on October 2, 2015 1:48 pm wrote:
> > > > David Kanter (dkanter.delete@this.realworldtech.com) on October 1, 2015 4:49 pm wrote:
> > > > > Maynard Handley (name99.delete@this.name99.org) on October 1, 2015 1:57 pm wrote:
> > > >
> > > > > > Of course Apple [and ARM in general] have the advantage of load-store pair which isn't
> > > > > > perfect (eg it's not going to help your vector throughput) but certainly helps in a large
> > > > > > set of common cases, and so reduces the pressure to amp up to Intel's 2+1 load/store.)
> > > > >
> > > > > Uh, LDP and STP mean its more likely that you want multiple load/store
> > > > > units. I'm not 100% sure of the semantics and benefits.
> > > >
> > > > No - it means you can achieve twice the bandwidth from a single load/store unit, and thus reduce
> > > > the need to add another.
> > >
> > > Can LDP/STP only target integer registers? The load/store units are
> > > only 128b wide, so you can't really do a double vector load.
> >
> > No, LDP/STP support integer, FP and vector registers. You can use LDP and STP even
> > if a CPU can't do it in a single cycle. This means it will run faster on some implementations
> > while having no negative effect on implementations that can't.
>
> Thanks for clearing that up!
>
> > > > Many of the workloads in SPEC2006 do simple stride-1 accesses, so
> > > > LDP/STP is extremely effective there -
> > >
> > > What % of loads and stores are pairs?
> >
> > I don't have detailed figures, but it's surprisingly common in hot loops in many benchmarks.
>
> I'm more curious about something like GCC that at least resembles normal
> code. Using LDP in Geekbench isn't actually helpful for real code.
GCC does do a lot of function calls. Not sure whether there are performance counters that can count load vs LDP, but a static count should give a reasonable idea anyway given GCC is not loop heavy.
> > > >besides making function entry/exit and all the memcpy
> > > > and string functions efficient of course.
> > >
> > > Memory and string functions should be using 128b vectors anyway. I see the
> > > biggest use case being for function calls and exits. But I could be convinced
> > > otherwise with data (or at least a better explanation of how it works).
> >
> > Vector instructions have much higher latencies than integer ones, and most strings
> > are fairly short, so they are not always the best option for string/mem functions.
>
> That's fair. But for a memcopy, I'd expect that vector ops are the way to go if you need to clear a page.
For clearing there is a special clear instruction - current cores clear 64-128 bytes per instruction as fast as L1 cache can write back into L2.
> > > > There is also the advantage of fewer instructions
> > > > to decode, rename and execute, so even a design with 2 load/store units benefits.
> > >
> > > Does it truly decode into 1 uop on modern uarchs? I am a bit skeptical, since
> > > it targets two registers, and there is the potential for line/page crossing.
> >
> > Yes the most common cases do indeed. Eg. Cortex-A57 can execute a long sequence of LDP's
> > of 2 64-bit registers at 1 cycle per instruction as long as it hits L1 and doesn't cross
> > a cacheline boundary. There aren't even any penalties for unaligned accesses.
>
> Wait. How does the decoder detect common cases? That doesn't make sense.
The decoder can decide whether to split based on writeback and/or width and type of the access. LDP of 2 32/64-bit registers is most common, so Cortex-A57 executes 32/64-bit LDPs in 1 cycle. 128-bit LDP takes 2 cycles (the manual doesn't say whether 1 or 2 micro-ops are used).
> It sounds like there is some logic in the pipeline for replays.
Every OoO core must support replays to deal with L1 and TLB misses. Doing a pipeline flush would be insane.
> > > > > For example, what happens if the pair loads target different pages?
> > > > > You'd need to do two separate translations through the TLB.
> > > >
> > > > Then the same thing happens as for any other load or store that crosses a page
> > > > or cacheline boundary. If you used 2 separate loads, you now have 3 accesses
> > > > for the split case rather than 2, so a wider load is always cheaper.
> > >
> > > My point is that handling this as two uops instead of one
> > > is much easier. I suspect the semantics for LDP/STP
> > > specifically do not guarantee atomic behavior, which makes me suspect it decodes into two uops.
> >
> > An implementation is allowed to decode into 2 uops indeed, but it would be a bad idea
> > to split 32-bit or 64-bit LDP/STP if you can already deal with 128-bit Q registers.
>
> Not all implementations have full width 128b datapath from the L1D.
Low-end implementations won't indeed, but they are less likely to be OoO.
Wilco