Goldilocks and the Three CPUs
I am going to put forth a performance optimality hypothesis for modern monolithic VLSI based MPU implementation without attempting rigorous proof but rather using an accessible, and hopefully convincing argument in its support. Our hypothesis is that for any given ISA and semiconductor process, there is an optimal physical size/complexity of CPU that will give a maximum execution speed or single thread performance for general purpose workloads. That is to say that CPU performance given by the equation:
Perf = (IPC * Freq)/Path
attains a maximum value for a design of finite size and complexity. In this equation Perf represents the performance in the form of the reciprocal of the time to complete a given computational task, IPC represents average instructions executed per clock cycle, Freq is the maximum CPU clock rate, and Path represents program path length in instructions to execute the given computational task to completion. For the sake of simplicity I will consider the program path length constant. In reality, one can add new instructions or addressing modes etc. to an ISA to reduce path length but that has similar effect on CPU complexity as an increasing IPC, not to mention it violates the original premise that an optimal CPU size exists for any given ISA.
I will not attempt to describe this optimal CPU in any way; I will simply posit that it exists and that any attempt to design a higher performance CPU by brainiac means (increasing issue width, OOOE capability, branch prediction accuracy etc), or speedracer means (simpler design, deeper pipelining etc) is doomed to failure. Keep in mind that I am considering only the CPU size and complexity, not that of the entire device. For the sake of analysis I will consider first level (L1) caches as an integral part of the CPU as per widely used convention. L2+ caches are usually considered external to the CPU since they do not generally influence processor clock rate and can be altered independently of the CPU. In contrast, altering the L1 cache would likely require modifications to the CPU.
Let’s start with the concept of a processor’s critical loops. A critical loop is an elementary processor function that must complete in a limited number of clock cycles and cannot be further pipelined due to cost or performance constraints . Some examples of critical loops include integer ALU result forwarding, data cache access, next PC generation, register renaming in OOOE processors etc. A basic processor consists of mechanisms to fetch a stream of instructions from an instruction cache, decode them, schedule them for execution, execute them (possibly accessing data cache or determining branch prediction accuracy in the process), and retire them while preserving necessary ISA requirements for serial program execution behavior and exactness of exception handling. Because of the existence of critical processor loops, it is necessary for some internal signals to travel a certain significant portion of the total CPU physical size within a single clock period. Some special purpose computational devices such as systolic arrays and graphics processors have far less demanding critical loops but these are not designed to handle general purpose applications which are often intensely control and data dependent in nature.
Since top level CPU organization and topology change little with CPU complexity, we will assume that global signal wire lengths for critical loops will increase proportionately to the square root of CPU complexity measured in area or transistor count:
Critical Signal Length = Comp1/2
RC wire delay increases quadratically with length, hence global signal wire delay increases linearly with CPU complexity (area or transistors). The maximum operating frequency of a processor is defined by the reciprocal of the critical path timing plus extra time related to flip-flop setup and clock skew and jitter. Skew is timing variation in the clock distribution system that occurs due to geographical location (i.e. point A is 5 mm away from point B). Jitter is period to period variation in the timing of the clock edges, and can sometimes be modeled as a random variable. Let’s define maximum clock frequency as:
Freq = 1/(Tclk + Ttran + Twire) = 1/(Tclk + Ttran + K1*Comp)
where Tclk is the sum of flip-flop setup time and clock skew and jitter, Ttran is the transistor portion of critical path delay, and Twire is the global wire portion of critical path delay which can expressed as the product of a constant K1 and CPU complexity Comp. Observation of the history of microprocessor development into the superscalar age suggests that IPC improves at best as the square root of CPU complexity . So we can write IPC and Performance, Perf, as:
IPC = K2 * Comp1/2 Perf = (IPC * Freq)/Path = (K2/Path)*Comp1/2*(Tclk+ Ttran + K1Comp)
With this formulation maximum performance is achieved for a design whose complexity is such that the global wire delay component of critical path timing, K1 Comp, is equal to the sum of the latch overhead and transistor portion of critical path timing, Tclk + Ttran. The performance of CPU designs less complex than this is sub-optimal because IPC falls faster than maximum frequency increases. For more complex designs performance is sub-optimal because IPC increases more slowly than the maximum frequency falls. This concept is shown in Figure 4.
Figure 4 – Optimal Performance CPU Size/Complexity
Although useful to demonstrate the principle of optimal CPU size, the previous derived formulation was greatly simplified from reality. For example, the transistor based portion of critical path timing would also tend to increase as with CPU complexity (although much slower than linearly, perhaps logarithmically). And IPC is not a continuous function of complexity but is rather discontinuous. For example you can build a 2 way or 3 way superscalar processor, but you cannot build a 2.7 way superscalar processor. Nevertheless many aspects of CPU design have much finer granularity than issue width yet also influence performance. For example the number of TLB entries, length of instruction queues, the capacity of branch target buffers and return address shadow stacks and so on.
So where is the state of the art in MPU design today with respect to the hypothetical optimal performance CPU? Indications are that current CPU designs are still much simpler, smaller, and slower than optimal performance CPU designs for their respective ISA and manufacturing processes. Real world commercial MPU designs face constraints like economic and manufacturing brakes on die size growth, practical restraints on power budgets, EDA tool limitations, and time to market concerns that keep them from even approaching the maximum single thread performance potential at the present time. However the BBW trend not only brings current designs closer to the point of optimal performance, but process scaling trends are quickly bringing the optimal performance CPU complexity limit closer to current designs as I will show in the next section.
Discuss (11 comments)