Shoot the Data out First and Ask Questions Later
During pipe stage 18, the effective address generated by the AGU is used to access TLB and tag arrays. I am not aware of any technique to implement a sum-addressed CAM for the TLB, so it is unlikely that the tag arrays employ sum-addressing. During pipe stage 19, the predicted way, the tag comparison results and the predicted way are cross-checked by special logic. If the cache access was invalid (the data wasn’t present, or it was but the wrong way was selected) then Willamette’s pipeline control logic has to squash all uOPs that used the invalid data, or results derived from that data, before the values are committed to the processor logical state. A normal data access is shown in Figure 5.
Figure 5. Normal Data Access
In this case the data is available with a 2 cycle load-use latency, and sometime after it is used the check logic OK’s the access and the dependent uOPs can commit their results. If the cache access misses or way mispredicted, then the execution of the affected uOPs must be squashed, the L1 fixed up (missed data fetched and/or way predictor state updated), and then the load uOP and all dependent uOPs are ‘replayed’ (i.e. rerun) to get the correct result. This process is shown in Figure 6.
Figure 6. Replayed Data Access
The first questions that should come to mind are how often do these ‘replay traps’ occur, and what is their effect on IPC. The occurrence of traps due to cache misses will be the same as for a conventional cache of the same size and organization, and will have a similar effect on processor performance. The only new factor is replay traps that occur as a result of way misprediction. It is likely that a way misprediction will incur at least a 3 cycle penalty. The misprediction will be recognized during pipe stage 19. With a bit of cleverness, the load uOP might restart execution at pipe stage 17 during the next clock period, with the way predictor updated on the fly.
So what is the likely effect of way misprediction? Apparently, very little. I did not factor any effect of way prediction in my Willamette estimations in Table 1, yet Intel claims better latency performance than I predicted. If Willamette does employs way prediction, its predictor will likely include a substantial number of entries, since only two bits are required per entry. Way prediction is a much more tractable problem than say, branch prediction, because the mapping of cache lines to cache ways changes very little. Once a given line is in the cache, it won’t ever change way unless it is evicted and reloaded again. That would seem to imply that when cache miss rates are low, way prediction will tend to be very accurate. And when the data cache miss rate is high, then way misprediction may be partially hidden by the overhead of cache line fills from the L2 or main memory.
Be the first to discuss this article!