Angels Dancing on the Head of a Pin
Given the number of complex variables that come into play in a modern computer it is very difficult to predict beforehand how long an individual instruction in the middle of a program will take to execute. In fact, it is kind of a misleading concept in itself. A register to register addition is often thought of as taking one clock cycle on most modern processors, RISC and CISC. But that only represents the extra time required to execute the instruction when it is inserted to a program. If the ADD is fetched and issued in an instruction slot that would have otherwise been empty or packed with a NOP, and there is no immediate use of the result of the ADD, then its presence might have zero impact on overall program execution time on a superscalar processor. So is the instruction really executing in zero or one clock cycles? No, far from it.
With modern, deeply pipelined, out-of-order execution processor hardware a simple register to register addition (ADD) might take 7 to 20+ clock cycles to fetch, decode, dispatch, schedule, issue, execute, write back, and retire. The exact timing of our ADD instruction would also be at the mercy of the execution timing of one or two other instructions, if their results are needed as inputs. Our ‘one cycle’ ADD instruction might actually linger in the processor for 100’s of cycles waiting to be retired if an earlier instruction accesses cache, misses, and causes a read to main memory. This is due to the fact that instructions must generally be retired in-order, even in out-of-order execution machines.
How can our one cycle ADD instruction really take 5 to 20 cycles to execute? The secret is parallelism. Modern processors work on the execution of many instructions at a given time. It is like asking how long a car factory takes to build a car. If you are a financial analyst, the metric you probably really want to know is the incremental time to build one more car. If a factory turns out 5 vehicles per hour, the incremental time to build another car would be 12 minutes, but that does not mean an individual car is built in 12 minutes. The secret is that the factory operates using an assembly line, with many vehicles in various stages of completion at any given time. The same principle applies to modern MPUs, with the execution pipeline being equivalent to the factory assembly line. This is how a 1 GHz processor might actually take 10 to 20 ns or more to execute a given instruction, yet still average over a billion instructions per second throughput.
About the best we can do is to describe instruction execution time statistically in the context of specific programs or groups of programs. Computer architects might say that the average instruction takes 1.1 clock cycles to execute on a Pentium Pro running the SPECint92 benchmark suite, while an Alpha EV6 averages 0.5 clock cycles per instruction running SPECint95. This information can be obtained either with cycle-accurate computer based simulation of the processor running the code in question (this is how the new CPU designs are optimized before implementation in silicon), or empirically using the performance statistics gathering counters and logic built into modern processors.
Fortunately, the specific time that an individual instruction takes to execute does not need to be known to determine performance at any granularity level that is of interest to users. It takes millions of computer instructions to generate each new image flashed to the screen in a 3D graphics based application. Even recalculating and displaying a modest spreadsheet might easily require the execution of millions of instructions. This is why it is possible to use statistical based performance analysis to accurately model and predict computer performance at the level of interest to users.
Be the first to discuss this article!