By: Paul A. Clayton (paaronclayton.delete@this.gmail.com), October 14, 2020 12:11 pm
Room: Moderated Discussions
I did learn a few things from reading "Improving the Utilization of Micro-operation
Caches in x86 Processors" (Jagadish B. Kotra and John Kalamatianos, 2020, PDF). I had read before that µop cache entries typically terminated at a naturally aligned boundary in the Icache, either the cache line or a smaller naturally aligned boundary. I was not aware that (large) immediates were placed in a separate section of the µop cache line. (Special handling makes sense since large immediates are used in a minority of µops. I suspect the authors were unaware of Heidi Pan's heads-and-tails proposal, which is highly reminiscent of the fill toward the middle presented for µop cache lines, especially since they do not present this as a compression mechansim — avoiding huge fixed-size µops with 64-bit immediate fields that are rarely used.)
I was also not aware that entries terminated at a predicted taken branch. I would have guessed an entry would extend at least to the last full instruction in a fetch chunk. (Conceivably one could even pack the raw partial instruction into an entry, but the extra complexity of handling the different storage — undecoded instruction data would have to be sent backward in the pipeline to the decoder — might not be justified by the small benefit of limited prefetching of a sometimes followed branch fall-through path or a function return.)
(I vaguely recall one of David Kanter's articles indicating that multiple ways were filled to supply extra capacity for a µop cache entry. Glancing at Agner Fog's description of Sandy Bridge's µop cache, it does not match well with the paper's design.)
It is not clear if the example sizes of uops (56 bits) and µop cache lines (64 bytes) were typical. With large immediates removed, I would have guessed µops would be smaller (though obviously such depends on how much decoding one allows after the µop cache and other factors). I would also have guessed that line size would not be a power of two bytes. With 7-byte µops and 4-byte immediate chunks, 64-byte lines would always have wasted storage (though this does not mean that such is necessarily less efficient is does draw the question); AMD or Intel could surely afford to spend a little design effort to save a few percent of µop cache area.
(I would also have guessed that larger immediates would be extended from the µop internal immediate field. 16-bit immediate chunks — one internal to the µop and zero to three external — would not add much complexity and potentially increase effective storage. This might be implemented and the paper does mention that standard code compression techniques could be applied to a µop cache.)
I consider Cache Line boundary AgnoStic design (CLASP) somewhat obvious (and a bad attempt at forming an acronym) when entries are allowed to be larger than an fetch chunk. I.e., if one has a buffer containing an extended basic block larger than a fetch chunk, extending such beyond the cache line boundary (and not just the fetch chunk boundary) seems a modest step. The coherence problem is easily handled by probing two sets as the paper notes. It does have the advantage of increasing bandwidth (besides increasing storage utilization).
The concept of compaction is (as noted in the paper) closely related to some cache compression proposals where two compressed cache blocks fit into a single line. (I would actually propose using grow inward from both sides and a storage chunk shared between two sets with the size of the storage chunk probably being somewhat larger than twice the average entry size. Such could use the full line for a single entry, allowing greater extension for larger extended basic blocks. One could also support variable line size, especially with high associativity and overlaid skewed associativity — overlaid skewed associativity would increase the opportunities for a particular address to find a better size match. Yes, I have OSA-on-the-brain disease.)
The replacement mechanism modifications were also somewhat straightforward yet creative. "Replacement Aware Compaction" simply prefers fitting an entry according to recency of existing entry's use rather than best fit to allow tracking recency of use per line with less retention of old entries that use only part of a line. (With grow inward sharing of storage and two distinct 'lines', having separate replacement tracking would seem less burdensome. One could also consider overwriting the end of an extended basic block to provide more room while still allowing part of an older entry to be retained.)
For "Prediction Window-Aware Compaction", it was not clear if the 31.6% of Prediction Windows containing two µop cache entries was after the CLASP enhancement or using the baseline design. In their design the µop count limit is set by line size (so no compaction opporunity would be presented). CLASP handles cache line boundaries. Maximum number of immediates could be easily extended by slightly increasing the size of µops, which might be easier/less overhead than the proposed compaction or at least comparable. "maximum number of micro-coded µops" (i.e., instructions decoded to four or more µops) seems unlikely to be a common entry termination condition. PWAC seems underwhelming, but perhaps I am missing something. (I would have thought that the microcoded instructions could be "easily" decode as 'internal function calls' rather than partially filling a entry and terminating the entry early. Supporting such might make supporting returns to the inside of a µop cache line easier, at least for leaf functions; a small buffer could hold the post-return µops whether the return was from a microcode sequence or an ordinary function call. If the called function is a cache miss, additional instructions might be fetched and decoded beyond the fetch chunk containing the call instruction with only modest cost, though it might be easier to just stop decode at the fetch chunk.)
Modeling using true LRU replacement seems unrealistic. (Greater associativity in the Icache than the Dcache also seems odd.)
The use of bad case benchmarks is normal (at least I assume that is what "These workloads exhibit significant under-utilization of micro-op cache due to fragmentation." means, otherwise something like "these workloads represent common under-utilization" would have been used, I am guessing), but measuring the normal case benefit (or harm) would have been nice if the simulation time was not too expensive.
With µop cache entries being extended basic blocks, I would have guessed that rename optimization would be applied. (One can determine which sources are actually written by previous instructions in the extended basic block and replace them with destination numbers, removing the need to check dependencies. One could also re-order µops, at least within a basic block, to reduce routing energy or provide some other benefit. If a banked register alias table was used, bank conflicts could be cached [such might also introduce another cause of entry termination — if bandwidth is limited at the RAT, higher µop cache fetch bandwidth would be less useful].)
I do not mean to be unfairly negative (and I hope my negative comments are not cruel or even just insensitive). I know it is easier to find fault than to provide superior alternatives, easy to ignore limited resources available for research, and easy to see creative ideas as less stunningly creative in hindsight. Since I have not written a single research paper (not even a survey), I am likely less sensitive to the difficulties. My expectations may also be unrealistic; not every expert in the field is more competent than I am in every aspect of the field (even though I am just a hobbyist), not every paper can be so clearly written that readers feel like geniuses because it was so easy to understand, not every paper is so thorough and well organized that readers feel the authors are reading their minds when objections or enhancements are addressed just as they come to the reader's mind.
Perhaps something positive will come from this post.
Caches in x86 Processors" (Jagadish B. Kotra and John Kalamatianos, 2020, PDF). I had read before that µop cache entries typically terminated at a naturally aligned boundary in the Icache, either the cache line or a smaller naturally aligned boundary. I was not aware that (large) immediates were placed in a separate section of the µop cache line. (Special handling makes sense since large immediates are used in a minority of µops. I suspect the authors were unaware of Heidi Pan's heads-and-tails proposal, which is highly reminiscent of the fill toward the middle presented for µop cache lines, especially since they do not present this as a compression mechansim — avoiding huge fixed-size µops with 64-bit immediate fields that are rarely used.)
I was also not aware that entries terminated at a predicted taken branch. I would have guessed an entry would extend at least to the last full instruction in a fetch chunk. (Conceivably one could even pack the raw partial instruction into an entry, but the extra complexity of handling the different storage — undecoded instruction data would have to be sent backward in the pipeline to the decoder — might not be justified by the small benefit of limited prefetching of a sometimes followed branch fall-through path or a function return.)
(I vaguely recall one of David Kanter's articles indicating that multiple ways were filled to supply extra capacity for a µop cache entry. Glancing at Agner Fog's description of Sandy Bridge's µop cache, it does not match well with the paper's design.)
It is not clear if the example sizes of uops (56 bits) and µop cache lines (64 bytes) were typical. With large immediates removed, I would have guessed µops would be smaller (though obviously such depends on how much decoding one allows after the µop cache and other factors). I would also have guessed that line size would not be a power of two bytes. With 7-byte µops and 4-byte immediate chunks, 64-byte lines would always have wasted storage (though this does not mean that such is necessarily less efficient is does draw the question); AMD or Intel could surely afford to spend a little design effort to save a few percent of µop cache area.
(I would also have guessed that larger immediates would be extended from the µop internal immediate field. 16-bit immediate chunks — one internal to the µop and zero to three external — would not add much complexity and potentially increase effective storage. This might be implemented and the paper does mention that standard code compression techniques could be applied to a µop cache.)
I consider Cache Line boundary AgnoStic design (CLASP) somewhat obvious (and a bad attempt at forming an acronym) when entries are allowed to be larger than an fetch chunk. I.e., if one has a buffer containing an extended basic block larger than a fetch chunk, extending such beyond the cache line boundary (and not just the fetch chunk boundary) seems a modest step. The coherence problem is easily handled by probing two sets as the paper notes. It does have the advantage of increasing bandwidth (besides increasing storage utilization).
The concept of compaction is (as noted in the paper) closely related to some cache compression proposals where two compressed cache blocks fit into a single line. (I would actually propose using grow inward from both sides and a storage chunk shared between two sets with the size of the storage chunk probably being somewhat larger than twice the average entry size. Such could use the full line for a single entry, allowing greater extension for larger extended basic blocks. One could also support variable line size, especially with high associativity and overlaid skewed associativity — overlaid skewed associativity would increase the opportunities for a particular address to find a better size match. Yes, I have OSA-on-the-brain disease.)
The replacement mechanism modifications were also somewhat straightforward yet creative. "Replacement Aware Compaction" simply prefers fitting an entry according to recency of existing entry's use rather than best fit to allow tracking recency of use per line with less retention of old entries that use only part of a line. (With grow inward sharing of storage and two distinct 'lines', having separate replacement tracking would seem less burdensome. One could also consider overwriting the end of an extended basic block to provide more room while still allowing part of an older entry to be retained.)
For "Prediction Window-Aware Compaction", it was not clear if the 31.6% of Prediction Windows containing two µop cache entries was after the CLASP enhancement or using the baseline design. In their design the µop count limit is set by line size (so no compaction opporunity would be presented). CLASP handles cache line boundaries. Maximum number of immediates could be easily extended by slightly increasing the size of µops, which might be easier/less overhead than the proposed compaction or at least comparable. "maximum number of micro-coded µops" (i.e., instructions decoded to four or more µops) seems unlikely to be a common entry termination condition. PWAC seems underwhelming, but perhaps I am missing something. (I would have thought that the microcoded instructions could be "easily" decode as 'internal function calls' rather than partially filling a entry and terminating the entry early. Supporting such might make supporting returns to the inside of a µop cache line easier, at least for leaf functions; a small buffer could hold the post-return µops whether the return was from a microcode sequence or an ordinary function call. If the called function is a cache miss, additional instructions might be fetched and decoded beyond the fetch chunk containing the call instruction with only modest cost, though it might be easier to just stop decode at the fetch chunk.)
Modeling using true LRU replacement seems unrealistic. (Greater associativity in the Icache than the Dcache also seems odd.)
The use of bad case benchmarks is normal (at least I assume that is what "These workloads exhibit significant under-utilization of micro-op cache due to fragmentation." means, otherwise something like "these workloads represent common under-utilization" would have been used, I am guessing), but measuring the normal case benefit (or harm) would have been nice if the simulation time was not too expensive.
With µop cache entries being extended basic blocks, I would have guessed that rename optimization would be applied. (One can determine which sources are actually written by previous instructions in the extended basic block and replace them with destination numbers, removing the need to check dependencies. One could also re-order µops, at least within a basic block, to reduce routing energy or provide some other benefit. If a banked register alias table was used, bank conflicts could be cached [such might also introduce another cause of entry termination — if bandwidth is limited at the RAT, higher µop cache fetch bandwidth would be less useful].)
I do not mean to be unfairly negative (and I hope my negative comments are not cruel or even just insensitive). I know it is easier to find fault than to provide superior alternatives, easy to ignore limited resources available for research, and easy to see creative ideas as less stunningly creative in hindsight. Since I have not written a single research paper (not even a survey), I am likely less sensitive to the difficulties. My expectations may also be unrealistic; not every expert in the field is more competent than I am in every aspect of the field (even though I am just a hobbyist), not every paper can be so clearly written that readers feel like geniuses because it was so easy to understand, not every paper is so thorough and well organized that readers feel the authors are reading their minds when objections or enhancements are addressed just as they come to the reader's mind.
Perhaps something positive will come from this post.
Topic | Posted By | Date |
---|---|---|
Zen 3 | Blue | 2020/10/08 09:58 AM |
Zen 3 | Rayla | 2020/10/08 10:10 AM |
Zen 3 | Adrian | 2020/10/08 10:13 AM |
Does anyone know whether Zen 3 has AVX-512? (NT) | Foo_ | 2020/10/08 11:54 AM |
Does anyone know whether Zen 3 has AVX-512? | Adrian | 2020/10/08 12:11 PM |
Zen 3 - Number of load/store units | ⚛ | 2020/10/08 10:21 AM |
Zen 3 - Number of load/store units | Rayla | 2020/10/08 10:28 AM |
Zen 3 - Number of load/store units | ⚛ | 2020/10/08 11:22 AM |
Zen 3 - Number of load/store units | Adrian | 2020/10/08 11:53 AM |
Zen 3 - Number of load/store units | Travis Downs | 2020/10/08 09:45 PM |
Zen 3 - CAD benchmark | Per Hesselgren | 2020/10/09 07:29 AM |
Zen 3 - CAD benchmark | Adrian | 2020/10/09 09:27 AM |
Zen 3 - Number of load/store units | itsmydamnation | 2020/10/08 02:38 PM |
Zen 3 - Number of load/store units | Groo | 2020/10/08 02:48 PM |
Zen 3 - Number of load/store units | Wilco | 2020/10/08 03:02 PM |
Zen 3 - Number of load/store units | Dummond D. Slow | 2020/10/08 04:39 PM |
Zen 3 - Number of load/store units | Doug S | 2020/10/09 08:11 AM |
Zen 3 - Number of load/store units | Dummond D. Slow | 2020/10/09 09:43 AM |
Zen 3 - Number of load/store units | Doug S | 2020/10/09 01:43 PM |
N7 and N7P are not load/Store units - please fix the topic in your replies (NT) | Heikki Kultala | 2020/10/10 07:37 AM |
Zen 3 | Jeff S. | 2020/10/08 12:16 PM |
Zen 3 | anon | 2020/10/08 01:57 PM |
Disappointing opening line in paper | Paul A. Clayton | 2020/10/11 06:16 AM |
Thoughts on "Improving the Utilization of µop Caches..." | Paul A. Clayton | 2020/10/14 12:11 PM |
Thoughts on "Improving the Utilization of µop Caches..." | anon | 2020/10/15 11:56 AM |
Thoughts on "Improving the Utilization of µop Caches..." | anon | 2020/10/15 11:57 AM |
Sorry about the mess | anon | 2020/10/15 11:58 AM |
Sorry about the mess | Brett | 2020/10/16 03:22 AM |
Caching dependence info in µop cache | Paul A. Clayton | 2020/10/16 06:20 AM |
Caching dependence info in µop cache | anon | 2020/10/16 12:36 PM |
Caching dependence info in µop cache | Paul A. Clayton | 2020/10/18 01:28 PM |
Zen 3 | juanrga | 2020/10/09 10:12 AM |
Zen 3 | Mr. Camel | 2020/10/09 06:30 PM |
Zen 3 | anon.1 | 2020/10/10 12:44 AM |
Cinebench is terrible benchmark | David Kanter | 2020/10/10 10:36 AM |
Cinebench is terrible benchmark | anon.1 | 2020/10/10 12:06 PM |
Cinebench is terrible benchmark | hobold | 2020/10/10 12:33 PM |
Some comments on benchmarks | Paul A. Clayton | 2020/10/14 12:11 PM |
Some comments on benchmarks | Mark Roulo | 2020/10/14 03:21 PM |
Zen 3 | Adrian | 2020/10/10 01:59 AM |
Zen 3 | Adrian | 2020/10/10 02:18 AM |
Zen 3 | majord | 2020/10/15 04:02 AM |
Zen 3 | hobold | 2020/10/10 08:58 AM |
Zen 3 | Maynard Handley | 2020/10/10 10:36 AM |
Zen 3 | hobold | 2020/10/10 12:19 PM |
Zen 3 | anon | 2020/10/11 02:58 AM |
Zen 3 | hobold | 2020/10/11 12:32 PM |
Zen 3 | anon | 2020/10/11 01:07 PM |
Zen 3 | hobold | 2020/10/11 02:22 PM |
Zen 3 | anon | 2020/10/10 11:51 AM |
Zen 3 | Michael S | 2020/10/11 01:16 AM |
Zen 3 | hobold | 2020/10/11 02:13 AM |
Zen 3 | Michael S | 2020/10/11 02:18 AM |
Zen 3 | anon.1 | 2020/10/11 12:17 PM |
Zen 3 | David Hess | 2020/10/12 06:43 AM |
more power? (NT) | anonymous2 | 2020/10/12 01:26 PM |
I think he's comparing 65W 3700X vs 105W 5800X (NT) | John H | 2020/10/12 04:33 PM |
?! Those are apples and oranges! (NT) | anon | 2020/10/12 04:49 PM |