By: juanrga (nospam.delete@this.juanrga.com), October 2, 2015 5:14 am
Room: Moderated Discussions
Patrick Chase (patrickjchase.delete@this.gmail.com) on October 1, 2015 5:38 pm wrote:
> juanrga (nospam.delete@this.juanrga.com) on October 1, 2015 11:50 am wrote:
> > David Kanter (dkanter.delete@this.realworldtech.com) on October 1, 2015 10:01 am wrote:
> > > It's nice to have more ALUs, but what really matters are the load/store units. Having 10
> > > ALUs with 1 LD/ST unit is really pointless, except on code with insanely high
> > > compute:memory ratios (which isn't most code).
> > >
> > > For a general purpose CPU, I'd focus on getting the load/store right first, then focus
> > > on the ALUs.
> >
> > The number and class of optimal units is a very interesting problem, but in
> > general ALUs outperform number of L/S units, because it is more probable your code
> > contains sections of intensive compute than otherwise.
>
> This is utter nonsense. Most "compute intensive" sections actually contain a LOT of loads/stores.
> In a previous job I implemented loads that were about as "compute intensive" as they come on
> a 4-wide machine that could do 1 L/S per clock. It has 64 architectural regs (renaming doesn't
> help you here), which is about as good as it gets in terms of avoiding excess memory ops due
> to spills/fills. Even so almost every nontrivial workload ended up constrained by L/S bandwidth,
> and I ended up spending a lot of time implementing in-register blocking schemes etc.
>
> Gabriele has similar horror stories from working with the same architecture.
>
> IMO even a 3:1 alu:mem ratio is uncomfortable, and 2:1
> is a practical minimum to avoid being memory-op-bound.
>
LOL. I didn't mean there is no "LOTS of loads/stores", I did mean that it is more probable to have more compute operations than loads/store operations and, indeed, all designs I know support this asymmetry on the execution units.
Some designs have a 1:1 ratio. E.g. some 4-wide VLIW processors with (1 ALU + 1 mem) [1].
But most designs have more ALUs than mem units. Some designs use simple mem units and then add more ALUs. For instance (4 ALU + 2 mem) or (3 ALU + 1 mem).
Other designs start with a 1:1 ratio and add the possibility of executing simple integer operations on the mem units, which breaks the 1:1 ratio to favor compute over load/store. For instance some (2 ALU + 2 mem) designs can perform up to four integer instructions per cycle on cases when there is no load/store operations.
I still have to find a commercial or academic design that does just the contrary; that is, a design with more mem units than ALUs (or FP units). For instance a design with (1 ALU + 1 FP + 4 mem). I am not saying that weird design doesn't exist, but just I don't know any. Do you?
[1] Before someone comes with another idiotic remark I am not mentioning above all the units of that 4-wide core with (BR + ALU + mem + FP).
> juanrga (nospam.delete@this.juanrga.com) on October 1, 2015 11:50 am wrote:
> > David Kanter (dkanter.delete@this.realworldtech.com) on October 1, 2015 10:01 am wrote:
> > > It's nice to have more ALUs, but what really matters are the load/store units. Having 10
> > > ALUs with 1 LD/ST unit is really pointless, except on code with insanely high
> > > compute:memory ratios (which isn't most code).
> > >
> > > For a general purpose CPU, I'd focus on getting the load/store right first, then focus
> > > on the ALUs.
> >
> > The number and class of optimal units is a very interesting problem, but in
> > general ALUs outperform number of L/S units, because it is more probable your code
> > contains sections of intensive compute than otherwise.
>
> This is utter nonsense. Most "compute intensive" sections actually contain a LOT of loads/stores.
> In a previous job I implemented loads that were about as "compute intensive" as they come on
> a 4-wide machine that could do 1 L/S per clock. It has 64 architectural regs (renaming doesn't
> help you here), which is about as good as it gets in terms of avoiding excess memory ops due
> to spills/fills. Even so almost every nontrivial workload ended up constrained by L/S bandwidth,
> and I ended up spending a lot of time implementing in-register blocking schemes etc.
>
> Gabriele has similar horror stories from working with the same architecture.
>
> IMO even a 3:1 alu:mem ratio is uncomfortable, and 2:1
> is a practical minimum to avoid being memory-op-bound.
>
LOL. I didn't mean there is no "LOTS of loads/stores", I did mean that it is more probable to have more compute operations than loads/store operations and, indeed, all designs I know support this asymmetry on the execution units.
Some designs have a 1:1 ratio. E.g. some 4-wide VLIW processors with (1 ALU + 1 mem) [1].
But most designs have more ALUs than mem units. Some designs use simple mem units and then add more ALUs. For instance (4 ALU + 2 mem) or (3 ALU + 1 mem).
Other designs start with a 1:1 ratio and add the possibility of executing simple integer operations on the mem units, which breaks the 1:1 ratio to favor compute over load/store. For instance some (2 ALU + 2 mem) designs can perform up to four integer instructions per cycle on cases when there is no load/store operations.
I still have to find a commercial or academic design that does just the contrary; that is, a design with more mem units than ALUs (or FP units). For instance a design with (1 ALU + 1 FP + 4 mem). I am not saying that weird design doesn't exist, but just I don't know any. Do you?
[1] Before someone comes with another idiotic remark I am not mentioning above all the units of that 4-wide core with (BR + ALU + mem + FP).