High-level optimizations V
Vectorization
In the seventies Seymour Cray started the supercomputer business around vector processors, processors that can perform floating point operations on large vectors. At this time the work of Kuck on automatically vectorizing compilers was key. Today x86 processors by AMD and Intel have vector instruction set extensions such as SSE. Essentially the same analysis is done as for unrolling. The scalar instructions are replaced with vector instructions acting on multiple elements simultaneously. In essence if there are no loop carried dependences it can be done, just as unrolling. The stumbling block here is the placement of data in memory. Vector load operations must be able to construct the vector register contents in a small enough time to make it worthwhile, which is tricky. The situation in vector supercomputers is more complex and rather obscure, I won’t bother with it.
Memory access optimizations
There are actually several dozen memory access optimizations; most are relevant primarily to clusters. I’ll describe a few of the general ones. See “Memory heirarchy design” in [1] for more on these.
Loop interchange
This is primarily a way of optimizing idiotic code. As there are relatively few idiotic programmers out there it is quite uncommon. Most programming languages have a rule that specifies how multi-dimensional arrays are laid out in memory. Programmers who do numerical work in C generally know that a multi-dimensional array is an array of arrays. So the elements accessed by the rightmost array index are adjacent in memory, this is “row-major form”. In Fortran the reverse rule is applied, the leftmost index refers to adjacent elements, this is “column major form”. When applying an operation to an entire multi-dimensional array it makes sense to iterate through adjacent elements, to do this you must know the rule the language uses. Doing it wrong tends to fill the cache with junk on modern processors. Loop order reversal has the ability to reorder loops to make up for programmer ignorance.
Blocking
Blocking is another multi-dimensional array optimization. The compiler chooses a block size. Instead of going across each whole array it splits up arrays into small blocks that can fit in the cache. Sometimes it is possible to fit the block into registers, for instance on IA64 where there are lots of registers.
Often I here it said that an optimizing compiler or an optimization switch can “buy you 5%”. This is totally incorrect, in some circumstances optimizations produce very little, in others they can occasionally produce improvements of more than an order of magnitude. This is especially true of blocking and of the loop optimizations mentioned here.
Memory bank optimizations
Some programmers believe that since computers are binary animals that using power of two to do everything is intrinsically A Good Thing. Sometimes in for instance fast-fourier transforms and radix sorts they are right. This instinct often extends to preferring array sizes that are factors of two. This is often a very bad idea.
Modern memory systems are often capable of servicing multiple requests. Also memory is divided into banks, the address space is constructed by interleaving banks together. This means that when adjacent addresses are accessed, as in iteration through one-dimensional arrays the requests go to different banks. Each bank services them, producing good utilization. The number of banks is almost always a power of two. This means that if a multi-dimensional array is a power of two in size and its columns are accessed they will be in the same bank. For instance iterating over every element of the array may do this. Since that one bank has limited bandwidth this causes a large slowdown. It can also happen with arrays of structs when the struct is a power of two in size. Having the compiler make arrays that are powers of two in size just a bit bigger can prevent this. Loop interchange can also sometimes help. It is also possible for hardware to solve the problem by mixing up data in the banks, using a method derived from the Chinese remainder theory this is described in [1].
Prefetching
The purpose of prefetching is to bring data that will be needed by the processor in the near future into the processor beforehand. There are two types of prefetching, software and hardware. Hardware prefetching involves no compiler support, some processor logic predicts the sequence of memory reads the processor will give and brings data in correspondingly. This is only really of use in loops where the sequence is predictable by simple means.
Software prefetching is a method of software-hardware cooperation. Recently it has been introduced in Intel and AMD microprocessors (though workstation RISC processors have had it for years). A new instruction, the prefetch is introduced in general this instruction takes an address as an argument and fills a cache line with the contents of that address (sometimes it uses a register). This makes the data available immediately when it is needed. Like hardware prefetching compiler controlled prefetching only really helps in loops. Prefetch instruction are generally “non-faulting”, this means that if the address they give isn’t in the current processes address space they do not raise an exception. This is very useful behavior from the compilers point of view since it means it can prefetch the next element of the array and not have to worry about running past the end. As with everything there are tradeoffs, a prefetch is an instruction and must be decoded, the compiler must be careful this use of decoding resources does not eliminate any benefit.
References
[1] Hennessy J., Patterson D., “Computer Architecture: A Quantitative Approach”, Morgan Kauffman, 2002.
[2] Krauss http://www.rebelution.net/upper/archive.shtml
Discuss (15 comments)