# loops

By: anon (spam.delete.delete@this.this.spam.com), April 26, 2017 4:08 am
RichardC (tich.delete@this.pobox.com) on April 25, 2017 6:41 pm wrote:
> anon (spam.delete.delete@this.this.spam.com) on April 23, 2017 6:35 am wrote:
>
> > For pure unrolling it's just a matter of predication. Losing 2 ALU cycles for of execution because
> > you used a 4 wide loop and needed 4n+2 is negligible. It's not done any different with vectors.
>
> It's a huge deal if n=0 or n=1, which is very common in some quite interesting cases
> (as mentioned before, looping over colliding entries in a hashtable bucket).
>

You get the same problem with vectorization though. You'd do it anyway because 1 or 2 cycles is better than 2 or 6 cycles.

> And the word "just" is almost always a sign of a weak argument.

As are generalizations.

>
> > If you need an actual prologue and epilogue it's still not wasted. Unless you've found
> > some magic that enables you to execute instructions before their operands are ready
> > you have to wait for the preceding instructions to finish on an OoO machine too.
>
> Not necessarily. The OoO may be finishing up some work unrelated to the loop.
> The OoO can *sometimes* use those execution units in the warmup and cooldown of the loop;
> the in-order VLIW can *never* use them.
>

Sometimes.
I don't think they'll ever have the problem of running out of ALUs.
Using the Mill Gold as an example I'd say 8 ALUs are enough that you won't run out most of the time and if you do waste a cycle it's still wide enough that the benefit of running the loop at "full" speed outweighs the downside.
What I'd expect to be a problem is utilization. I mean 4 ALUs vs 8 ALUs at 1/3 of the clockrate you have to keep them busy way better than OoO does to be faster.
