What is a Trace Cache?
The most interesting feature of Willamette is its trace cache. The trace cache differs from a conventional instruction cache in two basic ways. First, the trace cache stores micro-ops (uops) instead of x86 instruction bytes, and second, the trace cache organizes program data by the expected execution flow rather than by memory location. To appreciate the large potential advantage of the Willamette trace cache consider the simplified block diagram of a modern decoupled execution x86 processor like the P6-based Pentium III or the K7 Athlon shown in Figure 1.
Figure 1. Organization of a Modern x86 CPU with a Conventional I-cache
Notice that in the conventional processor the critical execution path includes branch prediction, fetch of raw program bytes from the I-cache and the recognition, parsing, extraction, and decode of x86 instructions to obtain a stream of uops to drive the out-of-order execution back end. It takes quite a bit of logic and wide signal paths to be able to fetch, align, and decode multiple x86 instructions per clock cycle. Contrast that with the Willamette block diagram shown in Figure 2.
Figure 2. Organization of Willamette With Trace Cache
Notice that in Willamette the branch prediction and x86 decoding steps are no longer within the processor’s critical execution path. Also notice that program execution parallelism is no longer limited by the number of x86 decoders. In fact the Willamette may only have one or two x86 instruction decoders.
The reason why this is possible is because the trace cache actually operates in two different modes. In “trace segment build mode” the control logic in the trace cache fetches x86 instructions for the predicted program execution path and decodes them into uops in response to a processor trace cache access that misses. These uops are gathered together and stored as a logical group called a trace segment in the cache. The uops can also be bypassed so that they feed the execution engine even while the trace segment is assembled.
Upon completion of a trace segment the trace cache enters “execute mode”. In this mode, uops are fetched in predicted program order from the trace cache and executed. The degree of program parallelism available for the execution engines is limited only by the number of uops that can be fetched from the trace cache per clock cycle, rather than by the number of x86 instruction decoders present. The trick to getting good performance out of the trace cache is the same for all caches, program execution in loops. Willamette gets high performance whenever it can stay in trace cache execute mode (and out of trace segment build mode) for extended periods such as in program loops. An important point to remember is that while a uop coming out of an x86 instruction decoder in a P6 or K7 processor is only ever executed once, a uop coming out of the Willamette’s x86 instruction decoder is retained and may be executed tens, hundred, or even thousands of times.
A second important issue is branches. In some ways, Willamette actually executes branches twice. The first time is when the trace cache is in trace segment build mode. When it encounters a conditional branch, the control unit has to predict which way program flow is likely to go and it continues to assemble the trace segment accordingly. Later on, in execute mode, the branch’s uop is executed and the actual branch condition resolved. If the branch goes against the trace segment’s implicit branch direction then a misprediction has occurred. When this happens the processor immediately stops executing the current trace segment, discards all computation results calculated by instructions following the mispredicted branch, and tries to locate a different trace segment within the trace cache based on the true branch direction. Willamette has about a twenty-cycle branch misprediction penalty even when the alternative path trace segment is present in the trace cache, but most of this is due to the stretched out execution pipeline. The trace cache can actually lookup a new trace address and resume uop streaming within four clock cycles of receiving a misprediction signal. If this lookup fails then the processor drops back to trace segment build mode and the branch misprediction penalty is far greater than twenty cycles.
Be the first to discuss this article!