Trace Cache Implementation
The Willamette trace cache actually consists of two distinct memory blocks. Uops are stored in the so-called data array while trace segment address and control information is stored in the tag array. This organization is shown in Figure 3.
Figure 3. Trace Cache Organization
The processor initiates a trace cache access by performing a trace segment lookup based on a program address. This is essentially a fully associate CAM (content-addressable memory) based search for a matching program (“linear”) address within the tag array. If successful, i.e. a trace cache hit, the trace cache starts fetching trace segment members (groups of uops) from the data array. This fetching process continues in a sequential manner through the data array (wrapping at the end of the array if necessary) until the end of the trace is detected, at which point another trace segment lookup is performed or until a branch mispredict is detected. If the trace segment lookup fails then the processor enters trace segment build mode.
Some x86 instructions are quite complex and take many (possibly hundreds) of uop equivalent processing steps to execute. These so-called “CISCy” or complex x86 instructions are cleverly handled in a special way in order to not to pollute the trace cache with long “canned” sequences of uops driven by microcode. In trace build mode when a CISCy instruction is encountered, the control logic adds a special microcode entry point address to the tag entry for that trace segment member and possibly a few prologue uops to the data array. In execute mode, the trace cache detects the presence of the microcode entry point in the tag and transitions to a special operating mode in which uops generated by a microcode sequencer unit are steered into the trace cache output path. Upon completion of a complex x86 instruction by the microsequencer, the Willamette trace cache resumes normal trace segment fetching from the next data line. The process of entering and exiting microsequencer mode is carefully integrated into normal trace cache operation and can be accomplished with very little overhead.
The organization of trace segment member uops within the data array is shown in Figure 4. A trace segment consists of one or more trace segment members that are groups of six uops stored within one of four “ways” within the trace cache data array. The first segment member in a trace is called the head, the last segment is called the tail, and every member in between are called body segment members. Two bits in the tag entry denote whether a trace segment member is a head, body, or tail element. The distinction is important to the operation of the finite state machines that drive the cache control logic. As previously mentioned, the head segment member of a trace is located by performing a lookup with the tag array. Subsequent members are stored in consecutive sets (rows) within the data array. The way location of trace data is stored as a couple of bits within the tag array elements which are effectively chained together like a doubly linked list for each distinct trace.
Figure 4 Trace Segment Organization Within the Data Array
In trace segment build mode the trace cache control logic assembles incoming uops from the x86 instruction decoder(s) into groups of up to 6 uops in size within the trace cache fill buffer. These groups may be filled with fewer than 6 uops (and presumably padded with NOPs) for a variety of reasons, including implementation restrictions such as not splitting uops from a single x86 instruction across data lines or hitting a maximum number of branch uops permitted per data line. When a data line is finished (either 6 uops are collected or some restriction came into effect) the contents of the fill buffer are written into the data array. The choice of way is made based on a least recently used (LRU) strategy to avoid overwriting any portion of a recently active trace segment (which is more likely to be executed again). The sharing of the data array by trace segment members from three different traces (each a distinct color) is shown in Figure 4. In trace segment build mode, the trace cache logic continues to decode x86 instructions and append uops to the current trace until a trace segment termination condition occurs. Trace segments are terminated when 1) an indirect branch, call, or return instruction is encountered, 2) a branch misprediction or exception is raised, or 3) the trace segment length reaches a set limit (64 was suggested in the trace cache patent).
Notice that in Figure 3 the output path of the trace cache is shown as being 3 uops wide, yet a trace segment member actually consists of up to 6 uops. Interestingly, the Intel trace cache patent shows no possible mechanism to separately address two halves of a segment member. My conclusion is that the Willamette trace cache outputs 6 uops every clock cycle using two separate transfers of three uops each. (i.e. the output path is double pumped). This is also a smart implementation choice as it eliminates over 350 bused signals from one of the most congested areas of the processor. The ability of the trace cache to provide 6 uops every cycle also seems to be the most reasonable conclusion in light of the amount of parallel execution resources available within the Willamette.
Be the first to discuss this article!