X86 Does it with Memory
Were the architects of Willamette a few transistors short of a flip flop when they bought their 33% shorter data cache access time at the price of a 2.2 higher average miss rate relative to K7? I think they knew exactly what they were doing and I’ll explain why. It is well known that the x86 is quite register poor. Assembly language programmers and compilers creating code for an x86 processor only have 8 general purpose registers (GPRs) to use and one of them (ESP) is used as the stack pointer. This contrasts sharply with RISC processors that typically have 32 GPRs as shown in Figure 1.
Figure 1. Register and Memory Usage: RISC versus x86
Typically, one RISC GPR always reads as zero while several are dedicated for use as stack pointer and other similar functions. This still leaves far more GPRs that a RISC compiler can allocate between storage for local variables, intermediate results of complex computations, and temporary variables. With only seven GPRs to work with, x86 compilers quickly have to turn to memory to store local variables and compiler temporary values in a structure called a stack frame. It is critical to the performance of an x86 processor that these memory based values can be quickly made available for computation. That is where the data cache latency comes into play. In simplistic terms, a 3 cycle load-use latency means that if a value is fetched from memory by a load instruction in cycle N, a second instruction that uses that value as an operand cannot execute until cycle N+3.
Modern x86 processors employ out-of-order (OOO) execution techniques. Thus, hardware is free to actually execute instructions in a different order than which they appear in the program. This means that if a load instruction executes in cycle N, and a dependent compute instruction is held frozen (stalled) until cycle N+3, the processor might be able to find independent instructions to execute in cycle N+1 and N+2. However, given the fact that modern x86 processors can execute up to three instructions per cycle, the odds of finding up to 6 independent instructions to hide (or cover) the load-use latency is rather small. When potential opportunities to execute instructions are lost they are lost forever, and this brings down the average IPC (instructions per cycle) and overall processor performance. By implementing a 2-cycle data cache in Willamette, Intel is greatly increasing the odds it can cover load-use latencies and reduce the impact of x86’s register poor architecture on performance. That is, if the data needed is in the cache.
Be the first to discuss this article!