By: Patrick Chase (patrickjchase.delete@this.gmail.com), October 1, 2015 4:38 pm
Room: Moderated Discussions
juanrga (nospam.delete@this.juanrga.com) on October 1, 2015 11:50 am wrote:
> David Kanter (dkanter.delete@this.realworldtech.com) on October 1, 2015 10:01 am wrote:
> > It's nice to have more ALUs, but what really matters are the load/store units. Having 10
> > ALUs with 1 LD/ST unit is really pointless, except on code with insanely high
> > compute:memory ratios (which isn't most code).
> >
> > For a general purpose CPU, I'd focus on getting the load/store right first, then focus
> > on the ALUs.
>
> The number and class of optimal units is a very interesting problem, but in
> general ALUs outperform number of L/S units, because it is more probable your code
> contains sections of intensive compute than otherwise.
This is utter nonsense. Most "compute intensive" sections actually contain a LOT of loads/stores. In a previous job I implemented loads that were about as "compute intensive" as they come on a 4-wide machine that could do 1 L/S per clock. It has 64 architectural regs (renaming doesn't help you here), which is about as good as it gets in terms of avoiding excess memory ops due to spills/fills. Even so almost every nontrivial workload ended up constrained by L/S bandwidth, and I ended up spending a lot of time implementing in-register blocking schemes etc.
Gabriele has similar horror stories from working with the same architecture.
IMO even a 3:1 alu:mem ratio is uncomfortable, and 2:1 is a practical minimum to avoid being memory-op-bound.
> David Kanter (dkanter.delete@this.realworldtech.com) on October 1, 2015 10:01 am wrote:
> > It's nice to have more ALUs, but what really matters are the load/store units. Having 10
> > ALUs with 1 LD/ST unit is really pointless, except on code with insanely high
> > compute:memory ratios (which isn't most code).
> >
> > For a general purpose CPU, I'd focus on getting the load/store right first, then focus
> > on the ALUs.
>
> The number and class of optimal units is a very interesting problem, but in
> general ALUs outperform number of L/S units, because it is more probable your code
> contains sections of intensive compute than otherwise.
This is utter nonsense. Most "compute intensive" sections actually contain a LOT of loads/stores. In a previous job I implemented loads that were about as "compute intensive" as they come on a 4-wide machine that could do 1 L/S per clock. It has 64 architectural regs (renaming doesn't help you here), which is about as good as it gets in terms of avoiding excess memory ops due to spills/fills. Even so almost every nontrivial workload ended up constrained by L/S bandwidth, and I ended up spending a lot of time implementing in-register blocking schemes etc.
Gabriele has similar horror stories from working with the same architecture.
IMO even a 3:1 alu:mem ratio is uncomfortable, and 2:1 is a practical minimum to avoid being memory-op-bound.