By: David Kanter (dkanter.delete@this.realworldtech.com), January 18, 2011 12:24 pm
Room: Moderated Discussions
? (0xe2.0x9a.0x9b@gmail.com) on 1/18/11 wrote:
---------------------------
>Matt Waldhauer (M.Waldhauer@gmx.de) on 1/18/11 wrote:
>---------------------------
>>David Kanter (dkanter@realworldtech.com) on 1/17/11 wrote:
>>---------------------------
>>>Mtatt Waldhauer (M.Waldhauer@gmx.de) on 1/17/11 wrote:
>>>---------------------------
>>>>? (0xe2.0x9a.0x9b@gmail.com) on 1/16/11 wrote:
>>>>---------------------------
>>>>>On the other hand, I predict that AMD will be forced to copy the Intel's microOP
>>>>>cache from Intel's Sandy Bridge (Unless AMD did not already do so in some form in
>>>>>the Bulldozer architecture. It would be nice if Bulldozer featured such a cache, but I doubt Bulldozer has it.)
>>>>
>>>>AMD already patented a branch redirect recovery cache. Very similar to the uOp cache. Does that count as copy? ;)
>>>
>>>So...judging from the name, this sounds very different from a uop cache. It sounds
>>>like they just have patented a cache that holds common targets of mispredicted branches.
>>>I.e. it reduces the penalty of a branch mispredict, rather than accelerating the common case.
>>>
>>>AFAIK, bulldozer does not have anything like the uop cache at all. Although AMD
>>>has always been a little less worried about decode bandwidth.
>>
>>Quotes from the patent:
>>"One or more decode stages (corresponding to the decode unit 18 ) may also be bypassed.
>>In the illustrated embodiment, pipeline stages corresponding to fetch and decode
>>are bypassed, and instructions are inserted into the rename/schedule unit 22."
>>
>>"The operations cached by the redirect recovery cache 20 may be in the form that
>>they exist at the pipeline stage in which they will be inserted (e.g. partially
>>or fully decoded form). Alternatively, the redirect recovery cache 20 may comprise
>>logic to modify the cached operations to produce the form they would have at the
>>inserted stage. This logic may be relatively simple logic that does not add clock
>>cycles to the path of inserting operations into the desired pipeline stage."
>>
>>So several forms incl. the caching of decoded ops are included. Since it's not
>>needed to cover correctly predicted branches, it can be small:
>>"The redirect recovery cache 20 may have any desired number of entries. A relatively
>>small number of entries may be used (e.g. 64 or 128) and may achieve relatively high hit rates for many applications."
>>
>>http://www.freepatentsonline.com/7685410.html
>
>I think David Kanter is wrong here, and Matt Waldhauer is right. The patent essentially
>gives AMD the right to implement any equivalent to Intel's microOP cache. A sample
>crucial paragraph is (column 9 on page 10 of the patent):
>
>"In some embodiments, the redirect recovery cache 20 may implement loop cache functionality.
>If a loop is detected, the loop may continue dispatching out of the redirect recovery
>cache 20 until the loop branch is mispredicted. The mispredict (sequential) path
>for the loop branch may also be stored in the redirect recovery cache 20 , further
>accelerating processing for the loop."
>
>The main point of the patent is the following: The recovery cache can start to
>supply decoded instructions directly to the scheduler after a branch misprediction,
>as indeed stated by the title of the patent. The decisive factor here is the potential
>length of the stream of the decoded instructions. In short, a *single* mispredict
>can (and that *is* the main hidden point of the patent) *trigger* a continuous stream
>of thousands of decoded instructions, a million of them, or 100 millions of them.
>If this is not functionally equivalent to a trace cache, then I don't know what is.
>
>(I wonder if the patent office are only playing dumb not to see that this is basically
>the same thing that Transmeta/Intel/whoever are *already* using in their products,
>or whether the patent office actually are so stupid. Not to mention that there are
>numerous other patents patenting a very similar (essentially identical) idea. Different words, but same idea.)
I hadn't read the patent until now due to connectivity issues...so my analysis was based mostly on the title. However, I believe my reading is still correct.
Also, just because it is patented doesn't mean it will be implemented at all...or in the same form as described in the patent.
The patent describes the same idea as a trace cache, but a very different use. As the title of the patent indicates, this is used in the case of a branch misprediction and redirection.
In the common case of a correct branch prediction, this approach won't have any benefits in terms of performance or power consumption. That is the point I was making. While it is similar to a trace cache (decoded uops in dynamic execution order), it is used for different circumstances.
So I'd say that Intel's uop cache differs in two respects:
1. Trace cache (patent) vs. decoded uop cache (SNB)
2. Used for mispredicts (patent) vs. all fetches (SNB)
David
---------------------------
>Matt Waldhauer (M.Waldhauer@gmx.de) on 1/18/11 wrote:
>---------------------------
>>David Kanter (dkanter@realworldtech.com) on 1/17/11 wrote:
>>---------------------------
>>>Mtatt Waldhauer (M.Waldhauer@gmx.de) on 1/17/11 wrote:
>>>---------------------------
>>>>? (0xe2.0x9a.0x9b@gmail.com) on 1/16/11 wrote:
>>>>---------------------------
>>>>>On the other hand, I predict that AMD will be forced to copy the Intel's microOP
>>>>>cache from Intel's Sandy Bridge (Unless AMD did not already do so in some form in
>>>>>the Bulldozer architecture. It would be nice if Bulldozer featured such a cache, but I doubt Bulldozer has it.)
>>>>
>>>>AMD already patented a branch redirect recovery cache. Very similar to the uOp cache. Does that count as copy? ;)
>>>
>>>So...judging from the name, this sounds very different from a uop cache. It sounds
>>>like they just have patented a cache that holds common targets of mispredicted branches.
>>>I.e. it reduces the penalty of a branch mispredict, rather than accelerating the common case.
>>>
>>>AFAIK, bulldozer does not have anything like the uop cache at all. Although AMD
>>>has always been a little less worried about decode bandwidth.
>>
>>Quotes from the patent:
>>"One or more decode stages (corresponding to the decode unit 18 ) may also be bypassed.
>>In the illustrated embodiment, pipeline stages corresponding to fetch and decode
>>are bypassed, and instructions are inserted into the rename/schedule unit 22."
>>
>>"The operations cached by the redirect recovery cache 20 may be in the form that
>>they exist at the pipeline stage in which they will be inserted (e.g. partially
>>or fully decoded form). Alternatively, the redirect recovery cache 20 may comprise
>>logic to modify the cached operations to produce the form they would have at the
>>inserted stage. This logic may be relatively simple logic that does not add clock
>>cycles to the path of inserting operations into the desired pipeline stage."
>>
>>So several forms incl. the caching of decoded ops are included. Since it's not
>>needed to cover correctly predicted branches, it can be small:
>>"The redirect recovery cache 20 may have any desired number of entries. A relatively
>>small number of entries may be used (e.g. 64 or 128) and may achieve relatively high hit rates for many applications."
>>
>>http://www.freepatentsonline.com/7685410.html
>
>I think David Kanter is wrong here, and Matt Waldhauer is right. The patent essentially
>gives AMD the right to implement any equivalent to Intel's microOP cache. A sample
>crucial paragraph is (column 9 on page 10 of the patent):
>
>"In some embodiments, the redirect recovery cache 20 may implement loop cache functionality.
>If a loop is detected, the loop may continue dispatching out of the redirect recovery
>cache 20 until the loop branch is mispredicted. The mispredict (sequential) path
>for the loop branch may also be stored in the redirect recovery cache 20 , further
>accelerating processing for the loop."
>
>The main point of the patent is the following: The recovery cache can start to
>supply decoded instructions directly to the scheduler after a branch misprediction,
>as indeed stated by the title of the patent. The decisive factor here is the potential
>length of the stream of the decoded instructions. In short, a *single* mispredict
>can (and that *is* the main hidden point of the patent) *trigger* a continuous stream
>of thousands of decoded instructions, a million of them, or 100 millions of them.
>If this is not functionally equivalent to a trace cache, then I don't know what is.
>
>(I wonder if the patent office are only playing dumb not to see that this is basically
>the same thing that Transmeta/Intel/whoever are *already* using in their products,
>or whether the patent office actually are so stupid. Not to mention that there are
>numerous other patents patenting a very similar (essentially identical) idea. Different words, but same idea.)
I hadn't read the patent until now due to connectivity issues...so my analysis was based mostly on the title. However, I believe my reading is still correct.
Also, just because it is patented doesn't mean it will be implemented at all...or in the same form as described in the patent.
The patent describes the same idea as a trace cache, but a very different use. As the title of the patent indicates, this is used in the case of a branch misprediction and redirection.
In the common case of a correct branch prediction, this approach won't have any benefits in terms of performance or power consumption. That is the point I was making. While it is similar to a trace cache (decoded uops in dynamic execution order), it is used for different circumstances.
So I'd say that Intel's uop cache differs in two respects:
1. Trace cache (patent) vs. decoded uop cache (SNB)
2. Used for mispredicts (patent) vs. all fetches (SNB)
David



