Compilers – Code Optimization

Pages: 1 2 3 4 5 6

High-level optimizations IV

Loop unrolling

The motivation for this optimization is the same as invariant hoisting; code in the middle of a loop is executed many times so more resources should be spent optimizing it. Unrolling means copying out the body of the loop several times and changing the loop index correspondingly. Doing this means that the time spent executing the branch instruction at the end of the loop and updating the index is halved. This may appear a trivial gain but if the loop body is small it can be very significant indeed. For instance the following assembly language adds a value in floating point register F2 to every element of an array:

      LD   F0, 0(R1) ; get array element no R1 and store in F0
      ADDD F4, F0, F2 ; perform add
      SD   0(R1), F4 ; store result back where it came
      SUBI R1, R1, #8 ; R1 = R1 - 8; 8 bytes = 1 double precision floating point number
      BNEZ R1, Loop ; Branch if not equal to zero

This example is drawn from [1]. Using the very crude assumption that the processor always executes an instruction per clock cycle in the above two cycles per loop iteration are spent performing the subtract and the branch instructions. Unrolling the body of the loop once more gives:

      LD   F0, 0(R1)
      ADDD F4, F0, F2
      SD   0(R1), F4
      LD   F6, -8(R1)
      ADDD F8, F6, F2
      SD   -8(R1), F8
      SUBI R1, R1, #16
      BNEZ R1, Loop

Under the crude assumption (one instruction per cycle) I gave before this loop will execute in 80% the time of the original version. This maybe taken further, it could be unrolled 3 or more times. Each unrolling will reduce the cost of the SUBI & BNEZ instructions, the “loop overhead”. There is of course a limit where further unrolling gains little benefit; but it varies from machine to machine and loop to loop.

Modern microprocessors gain much more from loop unrolling than indicated above for several reasons. Most importantly they are heavily pipelined this means that branches are expensive operations. Processors have branch target buffers(BTBs) and branch predictors once the loop has iterated once the target address of the branch will be stored in the BTB. However, in some processors the BTB and branch predictors take considerable time to produce addresses, slowing the loop down considerably. More importantly on in-order processors loop unrolling gives a latter optimization namely scheduling the ability to make more parallel code. Scheduling reorders the instructions to reveal parallelism. The example above will be extended in the discussion of scheduling later.

Loop unrolling has a number of problems; first the compiler has no real way of knowing which loops in a program do the work. This means it must guess, which is quite tricky, although estimates can be made from the number of iterations to be performed and the number of nested loops. It is only really worth unrolling loops that have a small amount of code in the body, since in larger ones the loop overhead will be small. After all unlikely candidates are removed two things should have happened: First, the loops that do the work should be unrolled. Secondly, few loops should be unnecessarily unrolled. If they are unrolled the code grows in size considerably and sometimes runs slower.

Sometimes the range over which a loop is executed is fixed, e.g. 0->100. More often it is variable, if so another loop must follow the unrolled loop to take account of the possibility that the range ends on an odd number (or more generally a number not divisible by the number of unrolls).

Loop fusion and Loop fission

Programmers don’t always split work across loops in an optimal way. For instance sometimes sets of loops that iterate across the same range of the same data, in this case the loops can be fused into one, removing some loop overhead and improving cache behavior.

/* Before */

   for (i = 0; i < M; i = i + 1) a[i] = b[i] / c[i];
   for (i = 0; i < M; i = i + 1) d[i] = a[i] + c[i];

/* After */

   for (i = 0; i < M; i = i + 1) {
      a[i] = b[i] / c[i];
      d[i] = a[i] + c[i];

More obscure is loop fission. Most loop optimizations rely on the code to have no dependences. A dependence (a “loop-carried” dependence) occurs when one iteration of the loop requires the previous iteration to have occurred before it. This could happen if for instance a statement refers to an array element calculated in the previous iteration. This kind of thing kills vectorization, can prevent scheduling working well after loop unrolling and sometimes prevents loop unrolling altogether. Loop fission is a partial solution if it is possible the loop is split into two loops, one containing the dependence and the other containing the rest of the work of the loop, this second loop can be optimized well.

Induction variable elimination

Sometimes the loop variable isn’t needed, for instance in the following loop

   for(i = 0; i < M; i = i + 1) a[i] = 0;

i isn’t actually needed, it can be implemented as:

      SD   (R1), 0 ; store zero
      ADDI R1, R1, 4
      BNEZ R1

Notice this loop also goes backwards. Bringing me to:

Iteration order reversal

Doing a loop in the order the program indicates isn’t always a good idea. In particular, iterating from a low to a high number requires a comparison at the end of each iteration of the loop. If the loop ends on a zero things are easier, no comparison is needed since most machines have a “branch if zero” instruction.

Pages: « Prev   1 2 3 4 5 6   Next »

Discuss (15 comments)