By: David Kanter (dkanter.delete@this.realworldtech.com), October 2, 2015 1:59 pm
Room: Moderated Discussions
Wilco (Wilco.Dijkstra.delete@this.ntlworld.com) on October 2, 2015 1:48 pm wrote:
> David Kanter (dkanter.delete@this.realworldtech.com) on October 1, 2015 4:49 pm wrote:
> > Maynard Handley (name99.delete@this.name99.org) on October 1, 2015 1:57 pm wrote:
>
> > > Of course Apple [and ARM in general] have the advantage of load-store pair which isn't
> > > perfect (eg it's not going to help your vector throughput) but certainly helps in a large
> > > set of common cases, and so reduces the pressure to amp up to Intel's 2+1 load/store.)
> >
> > Uh, LDP and STP mean its more likely that you want multiple load/store
> > units. I'm not 100% sure of the semantics and benefits.
>
> No - it means you can achieve twice the bandwidth from a single load/store unit, and thus reduce
> the need to add another.
Can LDP/STP only target integer registers? The load/store units are only 128b wide, so you can't really do a double vector load.
> Many of the workloads in SPEC2006 do simple stride-1 accesses, so
> LDP/STP is extremely effective there -
What % of loads and stores are pairs?
>besides making function entry/exit and all the memcpy
> and string functions efficient of course.
Memory and string functions should be using 128b vectors anyway. I see the biggest use case being for function calls and exits. But I could be convinced otherwise with data (or at least a better explanation of how it works).
> There is also the advantage of fewer instructions
> to decode, rename and execute, so even a design with 2 load/store units benefits.
Does it truly decode into 1 uop on modern uarchs? I am a bit skeptical, since it targets two registers, and there is the potential for line/page crossing.
> > For example, what happens if the pair loads target different pages?
> > You'd need to do two separate translations through the TLB.
>
> Then the same thing happens as for any other load or store that crosses a page
> or cacheline boundary. If you used 2 separate loads, you now have 3 accesses
> for the split case rather than 2, so a wider load is always cheaper.
My point is that handling this as two uops instead of one is much easier. I suspect the semantics for LDP/STP specifically do not guarantee atomic behavior, which makes me suspect it decodes into two uops.
David
> David Kanter (dkanter.delete@this.realworldtech.com) on October 1, 2015 4:49 pm wrote:
> > Maynard Handley (name99.delete@this.name99.org) on October 1, 2015 1:57 pm wrote:
>
> > > Of course Apple [and ARM in general] have the advantage of load-store pair which isn't
> > > perfect (eg it's not going to help your vector throughput) but certainly helps in a large
> > > set of common cases, and so reduces the pressure to amp up to Intel's 2+1 load/store.)
> >
> > Uh, LDP and STP mean its more likely that you want multiple load/store
> > units. I'm not 100% sure of the semantics and benefits.
>
> No - it means you can achieve twice the bandwidth from a single load/store unit, and thus reduce
> the need to add another.
Can LDP/STP only target integer registers? The load/store units are only 128b wide, so you can't really do a double vector load.
> Many of the workloads in SPEC2006 do simple stride-1 accesses, so
> LDP/STP is extremely effective there -
What % of loads and stores are pairs?
>besides making function entry/exit and all the memcpy
> and string functions efficient of course.
Memory and string functions should be using 128b vectors anyway. I see the biggest use case being for function calls and exits. But I could be convinced otherwise with data (or at least a better explanation of how it works).
> There is also the advantage of fewer instructions
> to decode, rename and execute, so even a design with 2 load/store units benefits.
Does it truly decode into 1 uop on modern uarchs? I am a bit skeptical, since it targets two registers, and there is the potential for line/page crossing.
> > For example, what happens if the pair loads target different pages?
> > You'd need to do two separate translations through the TLB.
>
> Then the same thing happens as for any other load or store that crosses a page
> or cacheline boundary. If you used 2 separate loads, you now have 3 accesses
> for the split case rather than 2, so a wider load is always cheaper.
My point is that handling this as two uops instead of one is much easier. I suspect the semantics for LDP/STP specifically do not guarantee atomic behavior, which makes me suspect it decodes into two uops.
David