By: Brett (ggtgp.delete@this.yahoo.com), April 27, 2017 1:11 am
Room: Moderated Discussions
anon (spam.delete.delete@this.this.spam.com) on April 26, 2017 2:16 pm wrote:
> Brett (ggtgp.delete@this.yahoo.com) on April 26, 2017 1:27 pm wrote:
> > anon (spam.delete.delete@this.this.spam.com) on April 26, 2017 5:08 am wrote:
> > > RichardC (tich.delete@this.pobox.com) on April 25, 2017 6:41 pm wrote:
> > > > anon (spam.delete.delete@this.this.spam.com) on April 23, 2017 6:35 am wrote:
> > > >
> > > > > For pure unrolling it's just a matter of predication. Losing 2 ALU cycles for of execution because
> > > > > you used a 4 wide loop and needed 4n+2 is negligible. It's not done any different with vectors.
> > > >
> > > > It's a huge deal if n=0 or n=1, which is very common in some quite interesting cases
> > > > (as mentioned before, looping over colliding entries in a hashtable bucket).
> > > >
> > >
> > > You get the same problem with vectorization though. You'd do it
> > > anyway because 1 or 2 cycles is better than 2 or 6 cycles.
> > >
> > > > And the word "just" is almost always a sign of a weak argument.
> > >
> > > As are generalizations.
> > >
> > > >
> > > > > If you need an actual prologue and epilogue it's still not wasted. Unless you've found
> > > > > some magic that enables you to execute instructions before their operands are ready
> > > > > you have to wait for the preceding instructions to finish on an OoO machine too.
> > > >
> > > > Not necessarily. The OoO may be finishing up some work unrelated to the loop.
> > > > The OoO can *sometimes* use those execution units in the warmup and cooldown of the loop;
> > > > the in-order VLIW can *never* use them.
> > > >
> > >
> > > Sometimes.
> > > I don't think they'll ever have the problem of running out of ALUs.
> > > Using the Mill Gold as an example I'd say 8 ALUs are enough that you won't run
> > > out most of the time and if you do waste a cycle it's still wide enough that
> > > the benefit of running the loop at "full" speed outweighs the downside.
> > > What I'd expect to be a problem is utilization. I mean 4 ALUs vs 8 ALUs at 1/3 of
> > > the clockrate you have to keep them busy way better than OoO does to be faster.
> >
> > Gold has 37 pipelines and 16 retire stations So your numbers are a bit off. ;)
>
> Last I checked retire stations aren't ALUs.
>
> Only exu are ALUs.
You are correct, I got scammed, only 8 ALU's.
At least I am mostly right about the register file not being the performance killing limit that was claimed.
You win some, you lose some.
> Brett (ggtgp.delete@this.yahoo.com) on April 26, 2017 1:27 pm wrote:
> > anon (spam.delete.delete@this.this.spam.com) on April 26, 2017 5:08 am wrote:
> > > RichardC (tich.delete@this.pobox.com) on April 25, 2017 6:41 pm wrote:
> > > > anon (spam.delete.delete@this.this.spam.com) on April 23, 2017 6:35 am wrote:
> > > >
> > > > > For pure unrolling it's just a matter of predication. Losing 2 ALU cycles for of execution because
> > > > > you used a 4 wide loop and needed 4n+2 is negligible. It's not done any different with vectors.
> > > >
> > > > It's a huge deal if n=0 or n=1, which is very common in some quite interesting cases
> > > > (as mentioned before, looping over colliding entries in a hashtable bucket).
> > > >
> > >
> > > You get the same problem with vectorization though. You'd do it
> > > anyway because 1 or 2 cycles is better than 2 or 6 cycles.
> > >
> > > > And the word "just" is almost always a sign of a weak argument.
> > >
> > > As are generalizations.
> > >
> > > >
> > > > > If you need an actual prologue and epilogue it's still not wasted. Unless you've found
> > > > > some magic that enables you to execute instructions before their operands are ready
> > > > > you have to wait for the preceding instructions to finish on an OoO machine too.
> > > >
> > > > Not necessarily. The OoO may be finishing up some work unrelated to the loop.
> > > > The OoO can *sometimes* use those execution units in the warmup and cooldown of the loop;
> > > > the in-order VLIW can *never* use them.
> > > >
> > >
> > > Sometimes.
> > > I don't think they'll ever have the problem of running out of ALUs.
> > > Using the Mill Gold as an example I'd say 8 ALUs are enough that you won't run
> > > out most of the time and if you do waste a cycle it's still wide enough that
> > > the benefit of running the loop at "full" speed outweighs the downside.
> > > What I'd expect to be a problem is utilization. I mean 4 ALUs vs 8 ALUs at 1/3 of
> > > the clockrate you have to keep them busy way better than OoO does to be faster.
> >
> > Gold has 37 pipelines and 16 retire stations So your numbers are a bit off. ;)
>
> Last I checked retire stations aren't ALUs.
>
> Only exu are ALUs.
You are correct, I got scammed, only 8 ALU's.
At least I am mostly right about the register file not being the performance killing limit that was claimed.
You win some, you lose some.