By: David Kanter (dkanter.delete@this.realworldtech.com), October 1, 2015 3:49 pm
Room: Moderated Discussions
Maynard Handley (name99.delete@this.name99.org) on October 1, 2015 1:57 pm wrote:
> David Kanter (dkanter.delete@this.realworldtech.com) on October 1, 2015 10:01 am wrote:
> > Wouter Tinus (wouter.tinus.delete@this.gmail.com) on September 30, 2015 3:14 pm wrote:
> > > It seems easy to argue that Skylake is a 5-wide or even 6-wide machine.
> > >
> > > - 5 wide decode
> > > - 6 wide allocation/decoder queue
> > > - 6 wide ROB
> > > - 8 wide issue
> > > - 8 wide retire (4/thread)
> > >
> > > Though Haswell already added extra two extra issue ports, this the first real increase in width
> > > since the introduction of Merom back in 2006. Yet they didn't even bother to mention it at IDF :(
> >
> > Actually, I think Sandy Bridge and Haswell were more significant.
> >
> > It's nice to have more ALUs, but what really matters are the load/store units. Having 10 ALUs with 1 LD/ST
> > unit is really pointless, except on code with insanely high compute:memory ratios (which isn't most code).
> >
> > For a general purpose CPU, I'd focus on getting the load/store right first, then focus on the ALUs.
>
> I'm missing your point here, David. Is this sarcasm, or a dig at another CPU?
> Hasn't Intel had 2 load/1 store per cycle since, what, Sandy Bridge?
My above statement was serious. It is certainly negative commentary on many CPUs, which have crappy load/store units. But it wasn't a dig at any in particular. I could point out a number of CPUs with crappy memory hierarchies, starting with the P4 and Bulldozer.
> (FWIW I agree with you that load/store matters. I suspect that's a bottleneck Apple will tackle in the future
> moving from their current 2 loads or 1 load/1 store [I don't think they support 2 store/cycle];
2 stores/cycle is pretty expensive, and I suspect there is lower hanging fruit for Apple. Also, stores are very expensive in terms of coherency/consistency/ordering.
> but since they're more concerned with power than Intel getting to that point may require their
> swapping out the traditional style load-store queues (associative and so expensive) with the
> sort of "indexed" queues that have been suggested as one component of kilo-instruction class
> machines. This would be a venture into somewhat uncharted territory [I don't think anyone has
> commercialized these ideas yet] so I suspect they won't go there until they have to.
What kind of indexed queues? Honestly, getting away from cams in the load/store unit seems damn hard when you want low latency. If you want store forwarding, you simply have to do something like a cam check.
> Of course Apple [and ARM in general] have the advantage of load-store pair which isn't
> perfect (eg it's not going to help your vector throughput) but certainly helps in a large
> set of common cases, and so reduces the pressure to amp up to Intel's 2+1 load/store.)
Uh, LDP and STP mean its more likely that you want multiple load/store units. I'm not 100% sure of the semantics and benefits.
For example, what happens if the pair loads target different pages? You'd need to do two separate translations through the TLB.
David
> David Kanter (dkanter.delete@this.realworldtech.com) on October 1, 2015 10:01 am wrote:
> > Wouter Tinus (wouter.tinus.delete@this.gmail.com) on September 30, 2015 3:14 pm wrote:
> > > It seems easy to argue that Skylake is a 5-wide or even 6-wide machine.
> > >
> > > - 5 wide decode
> > > - 6 wide allocation/decoder queue
> > > - 6 wide ROB
> > > - 8 wide issue
> > > - 8 wide retire (4/thread)
> > >
> > > Though Haswell already added extra two extra issue ports, this the first real increase in width
> > > since the introduction of Merom back in 2006. Yet they didn't even bother to mention it at IDF :(
> >
> > Actually, I think Sandy Bridge and Haswell were more significant.
> >
> > It's nice to have more ALUs, but what really matters are the load/store units. Having 10 ALUs with 1 LD/ST
> > unit is really pointless, except on code with insanely high compute:memory ratios (which isn't most code).
> >
> > For a general purpose CPU, I'd focus on getting the load/store right first, then focus on the ALUs.
>
> I'm missing your point here, David. Is this sarcasm, or a dig at another CPU?
> Hasn't Intel had 2 load/1 store per cycle since, what, Sandy Bridge?
My above statement was serious. It is certainly negative commentary on many CPUs, which have crappy load/store units. But it wasn't a dig at any in particular. I could point out a number of CPUs with crappy memory hierarchies, starting with the P4 and Bulldozer.
> (FWIW I agree with you that load/store matters. I suspect that's a bottleneck Apple will tackle in the future
> moving from their current 2 loads or 1 load/1 store [I don't think they support 2 store/cycle];
2 stores/cycle is pretty expensive, and I suspect there is lower hanging fruit for Apple. Also, stores are very expensive in terms of coherency/consistency/ordering.
> but since they're more concerned with power than Intel getting to that point may require their
> swapping out the traditional style load-store queues (associative and so expensive) with the
> sort of "indexed" queues that have been suggested as one component of kilo-instruction class
> machines. This would be a venture into somewhat uncharted territory [I don't think anyone has
> commercialized these ideas yet] so I suspect they won't go there until they have to.
What kind of indexed queues? Honestly, getting away from cams in the load/store unit seems damn hard when you want low latency. If you want store forwarding, you simply have to do something like a cam check.
> Of course Apple [and ARM in general] have the advantage of load-store pair which isn't
> perfect (eg it's not going to help your vector throughput) but certainly helps in a large
> set of common cases, and so reduces the pressure to amp up to Intel's 2+1 load/store.)
Uh, LDP and STP mean its more likely that you want multiple load/store units. I'm not 100% sure of the semantics and benefits.
For example, what happens if the pair loads target different pages? You'd need to do two separate translations through the TLB.
David