By: anon (anon.delete@this.ymous.org), October 15, 2020 11:56 am
Room: Moderated Discussions
Paul A. Clayton (paaronclayton.delete@this.gmail.com) on October 14, 2020 12:11 pm wrote:
> I did learn a few things from reading "Improving the Utilization of Micro-operation
> Caches in x86 Processors" (Jagadish B. Kotra and John Kalamatianos, 2020, PDF). I had read before that µop
> cache entries typically terminated at a naturally aligned boundary in the Icache, either the cache line or a
> smaller naturally aligned boundary. I was not aware that (large) immediates were placed in a separate section
> of the µop cache line. (Special handling makes sense since large immediates are used in a minority of µops.
> I suspect the authors were unaware of Heidi Pan's heads-and-tails proposal, which is highly reminiscent of
> the fill toward the middle presented for µop cache lines, especially since they do not present this as a compression
> mechansim — avoiding huge fixed-size µops with 64-bit immediate fields that are rarely used.)
>
> I was also not aware that entries terminated at a predicted taken branch. I would have guessed an entry would
> extend at least to the last full instruction in a fetch chunk. (Conceivably one could even pack the raw partial
> instruction into an entry, but the extra complexity of handling the different storage — undecoded instruction
> data would have to be sent backward in the pipeline to the decoder — might not be justified by the small
> benefit of limited prefetching of a sometimes followed branch fall-through path or a function return.)
>
> (I vaguely recall one of David Kanter's articles indicating that multiple ways were filled
> to supply extra capacity for a µop cache entry. Glancing at Agner Fog's description
> of Sandy Bridge's µop cache, it does not match well with the paper's design.)
>
> It is not clear if the example sizes of uops (56 bits) and µop cache lines (64 bytes) were typical. With large
> immediates removed, I would have guessed µops would be smaller (though obviously such depends on how much decoding
> one allows after the µop cache and other factors). I would also have guessed that line size would not be a
> power of two bytes. With 7-byte µops and 4-byte immediate chunks, 64-byte lines would always have wasted storage
> (though this does not mean that such is necessarily less efficient is does draw the question); AMD or Intel
> could surely afford to spend a little design effort to save a few percent of µop cache area.
>
> (I would also have guessed that larger immediates would be extended from the µop internal immediate
> field. 16-bit immediate chunks — one internal to the µop and zero to three external — would not
> add much complexity and potentially increase effective storage. This might be implemented and the paper
> does mention that standard code compression techniques could be applied to a µop cache.)
>
> I consider Cache Line boundary AgnoStic design (CLASP) somewhat obvious (and a bad attempt at forming an acronym)
> when entries are allowed to be larger than an fetch chunk. I.e., if one has a buffer containing an extended
> basic block larger than a fetch chunk, extending such beyond the cache line boundary (and not just the fetch
> chunk boundary) seems a modest step. The coherence problem is easily handled by probing two sets as the paper
> notes. It does have the advantage of increasing bandwidth (besides increasing storage utilization).
>
> The concept of compaction is (as noted in the paper) closely related to some cache compression proposals where
> two compressed cache blocks fit into a single line. (I would actually propose using grow inward from both sides
> and a storage chunk shared between two sets with the size of the storage chunk probably being somewhat larger
> than twice the average entry size. Such could use the full line for a single entry, allowing greater extension
> for larger extended basic blocks. One could also support variable line size, especially with high associativity
> and overlaid skewed associativity — overlaid skewed associativity would increase the opportunities for a
> particular address to find a better size match. Yes, I have OSA-on-the-brain disease.)
>
> The replacement mechanism modifications were also somewhat straightforward yet creative. "Replacement
> Aware Compaction" simply prefers fitting an entry according to recency of existing entry's use rather
> than best fit to allow tracking recency of use per line with less retention of old entries that use only
> part of a line. (With grow inward sharing of storage and two distinct 'lines', having separate replacement
> tracking would seem less burdensome. One could also consider overwriting the end of an extended basic
> block to provide more room while still allowing part of an older entry to be retained.)
>
> For "Prediction Window-Aware Compaction", it was not clear if the 31.6% of Prediction Windows containing two
> µop cache entries was after the CLASP enhancement or using the baseline design. In their design the µop
> count limit is set by line size (so no compaction opporunity would be presented). CLASP handles cache line
> boundaries. Maximum number of immediates could be easily extended by slightly increasing the size of µops,
> which might be easier/less overhead than the proposed compaction or at least comparable. "maximum number of
> micro-coded µops" (i.e., instructions decoded to four or more µops) seems unlikely to be a common entry termination
> condition. PWAC seems underwhelming, but perhaps I am missing something. (I would have thought that the microcoded
> instructions could be "easily" decode as 'internal function calls' rather than partially filling a entry and
> terminating the entry early. Supporting such might make supporting returns to the inside of a µop cache line
> easier, at least for leaf functions; a small buffer could hold the post-return µops whether the return was
> from a microcode sequence or an ordinary function call. If the called function is a cache miss, additional
> instructions might be fetched and decoded beyond the fetch chunk containing the call instruction with only
> modest cost, though it might be easier to just stop decode at the fetch chunk.)
>
> Modeling using true LRU replacement seems unrealistic. (Greater
> associativity in the Icache than the Dcache also seems odd.)
>
> The use of bad case benchmarks is normal (at least I assume that is what "These workloads exhibit significant
> under-utilization of micro-op cache due to fragmentation." means, otherwise something like "these workloads
> represent common under-utilization" would have been used, I am guessing), but measuring the normal case
> benefit (or harm) would have been nice if the simulation time was not too expensive.
>
> With µop cache entries being extended basic blocks, I would have guessed that rename optimization
> would be applied. (One can determine which sources are actually written by previous instructions
> in the extended basic block and replace them with destination numbers, removing the need to check
> dependencies. One could also re-order µops, at least within a basic block, to reduce routing
> energy or provide some other benefit. If a banked register alias table was used, bank conflicts
> could be cached [such might also introduce another cause of entry termination — if bandwidth
> is limited at the RAT, higher µop cache fetch bandwidth would be less useful].)
>
> I do not mean to be unfairly negative (and I hope my negative comments are not cruel or even just insensitive).
> I know it is easier to find fault than to provide superior alternatives, easy to ignore limited resources
> available for research, and easy to see creative ideas as less stunningly creative in hindsight. Since
> I have not written a single research paper (not even a survey), I am likely less sensitive to the difficulties.
> My expectations may also be unrealistic; not every expert in the field is more competent than I am in
> every aspect of the field (even though I am just a hobbyist), not every paper can be so clearly written
> that readers feel like geniuses because it was so easy to understand, not every paper is so thorough
> and well organized that readers feel the authors are reading their minds when objections or enhancements
> are addressed just as they come to the reader's mind.
>
> Perhaps something positive will come from this post.
Thank you for this detailed analysis. I haven't read the paper thoroughly yet, but I wanted to discuss one of your comments about reordering uops and rename optimizations.
Regarding reordering, the problem here is that is that you cannot generally rename out-of-order because although this might not have any impact at first glance (What's the difference between "add rax, rcx; ld rbx, [rdx], add r12, rax" and "ld rbx, [rdx], add rax, rcx; add r12, rax"?), I think it it gets messy if you want precise exceptions/interruptions. So, you probably can rename out-of-order but you need to map those out-of-order mappings back to a ROB-like structure that is allocated earlier than rename in the pipeline and it might be weird. However, that is an interesting thought because I know of some designs where the RAT is port-limited and so rename groups with at most x reads/writes have to be formed, which may not match what is coming out of Decode (compiler could help though).
On the rename optimization thing ("rewriting"). I am not sure I followed the idea. Could you please elaborate?
> I did learn a few things from reading "Improving the Utilization of Micro-operation
> Caches in x86 Processors" (Jagadish B. Kotra and John Kalamatianos, 2020, PDF). I had read before that µop
> cache entries typically terminated at a naturally aligned boundary in the Icache, either the cache line or a
> smaller naturally aligned boundary. I was not aware that (large) immediates were placed in a separate section
> of the µop cache line. (Special handling makes sense since large immediates are used in a minority of µops.
> I suspect the authors were unaware of Heidi Pan's heads-and-tails proposal, which is highly reminiscent of
> the fill toward the middle presented for µop cache lines, especially since they do not present this as a compression
> mechansim — avoiding huge fixed-size µops with 64-bit immediate fields that are rarely used.)
>
> I was also not aware that entries terminated at a predicted taken branch. I would have guessed an entry would
> extend at least to the last full instruction in a fetch chunk. (Conceivably one could even pack the raw partial
> instruction into an entry, but the extra complexity of handling the different storage — undecoded instruction
> data would have to be sent backward in the pipeline to the decoder — might not be justified by the small
> benefit of limited prefetching of a sometimes followed branch fall-through path or a function return.)
>
> (I vaguely recall one of David Kanter's articles indicating that multiple ways were filled
> to supply extra capacity for a µop cache entry. Glancing at Agner Fog's description
> of Sandy Bridge's µop cache, it does not match well with the paper's design.)
>
> It is not clear if the example sizes of uops (56 bits) and µop cache lines (64 bytes) were typical. With large
> immediates removed, I would have guessed µops would be smaller (though obviously such depends on how much decoding
> one allows after the µop cache and other factors). I would also have guessed that line size would not be a
> power of two bytes. With 7-byte µops and 4-byte immediate chunks, 64-byte lines would always have wasted storage
> (though this does not mean that such is necessarily less efficient is does draw the question); AMD or Intel
> could surely afford to spend a little design effort to save a few percent of µop cache area.
>
> (I would also have guessed that larger immediates would be extended from the µop internal immediate
> field. 16-bit immediate chunks — one internal to the µop and zero to three external — would not
> add much complexity and potentially increase effective storage. This might be implemented and the paper
> does mention that standard code compression techniques could be applied to a µop cache.)
>
> I consider Cache Line boundary AgnoStic design (CLASP) somewhat obvious (and a bad attempt at forming an acronym)
> when entries are allowed to be larger than an fetch chunk. I.e., if one has a buffer containing an extended
> basic block larger than a fetch chunk, extending such beyond the cache line boundary (and not just the fetch
> chunk boundary) seems a modest step. The coherence problem is easily handled by probing two sets as the paper
> notes. It does have the advantage of increasing bandwidth (besides increasing storage utilization).
>
> The concept of compaction is (as noted in the paper) closely related to some cache compression proposals where
> two compressed cache blocks fit into a single line. (I would actually propose using grow inward from both sides
> and a storage chunk shared between two sets with the size of the storage chunk probably being somewhat larger
> than twice the average entry size. Such could use the full line for a single entry, allowing greater extension
> for larger extended basic blocks. One could also support variable line size, especially with high associativity
> and overlaid skewed associativity — overlaid skewed associativity would increase the opportunities for a
> particular address to find a better size match. Yes, I have OSA-on-the-brain disease.)
>
> The replacement mechanism modifications were also somewhat straightforward yet creative. "Replacement
> Aware Compaction" simply prefers fitting an entry according to recency of existing entry's use rather
> than best fit to allow tracking recency of use per line with less retention of old entries that use only
> part of a line. (With grow inward sharing of storage and two distinct 'lines', having separate replacement
> tracking would seem less burdensome. One could also consider overwriting the end of an extended basic
> block to provide more room while still allowing part of an older entry to be retained.)
>
> For "Prediction Window-Aware Compaction", it was not clear if the 31.6% of Prediction Windows containing two
> µop cache entries was after the CLASP enhancement or using the baseline design. In their design the µop
> count limit is set by line size (so no compaction opporunity would be presented). CLASP handles cache line
> boundaries. Maximum number of immediates could be easily extended by slightly increasing the size of µops,
> which might be easier/less overhead than the proposed compaction or at least comparable. "maximum number of
> micro-coded µops" (i.e., instructions decoded to four or more µops) seems unlikely to be a common entry termination
> condition. PWAC seems underwhelming, but perhaps I am missing something. (I would have thought that the microcoded
> instructions could be "easily" decode as 'internal function calls' rather than partially filling a entry and
> terminating the entry early. Supporting such might make supporting returns to the inside of a µop cache line
> easier, at least for leaf functions; a small buffer could hold the post-return µops whether the return was
> from a microcode sequence or an ordinary function call. If the called function is a cache miss, additional
> instructions might be fetched and decoded beyond the fetch chunk containing the call instruction with only
> modest cost, though it might be easier to just stop decode at the fetch chunk.)
>
> Modeling using true LRU replacement seems unrealistic. (Greater
> associativity in the Icache than the Dcache also seems odd.)
>
> The use of bad case benchmarks is normal (at least I assume that is what "These workloads exhibit significant
> under-utilization of micro-op cache due to fragmentation." means, otherwise something like "these workloads
> represent common under-utilization" would have been used, I am guessing), but measuring the normal case
> benefit (or harm) would have been nice if the simulation time was not too expensive.
>
> With µop cache entries being extended basic blocks, I would have guessed that rename optimization
> would be applied. (One can determine which sources are actually written by previous instructions
> in the extended basic block and replace them with destination numbers, removing the need to check
> dependencies. One could also re-order µops, at least within a basic block, to reduce routing
> energy or provide some other benefit. If a banked register alias table was used, bank conflicts
> could be cached [such might also introduce another cause of entry termination — if bandwidth
> is limited at the RAT, higher µop cache fetch bandwidth would be less useful].)
>
> I do not mean to be unfairly negative (and I hope my negative comments are not cruel or even just insensitive).
> I know it is easier to find fault than to provide superior alternatives, easy to ignore limited resources
> available for research, and easy to see creative ideas as less stunningly creative in hindsight. Since
> I have not written a single research paper (not even a survey), I am likely less sensitive to the difficulties.
> My expectations may also be unrealistic; not every expert in the field is more competent than I am in
> every aspect of the field (even though I am just a hobbyist), not every paper can be so clearly written
> that readers feel like geniuses because it was so easy to understand, not every paper is so thorough
> and well organized that readers feel the authors are reading their minds when objections or enhancements
> are addressed just as they come to the reader's mind.
>
> Perhaps something positive will come from this post.
Thank you for this detailed analysis. I haven't read the paper thoroughly yet, but I wanted to discuss one of your comments about reordering uops and rename optimizations.
Regarding reordering, the problem here is that is that you cannot generally rename out-of-order because although this might not have any impact at first glance (What's the difference between "add rax, rcx; ld rbx, [rdx], add r12, rax" and "ld rbx, [rdx], add rax, rcx; add r12, rax"?), I think it it gets messy if you want precise exceptions/interruptions. So, you probably can rename out-of-order but you need to map those out-of-order mappings back to a ROB-like structure that is allocated earlier than rename in the pipeline and it might be weird. However, that is an interesting thought because I know of some designs where the RAT is port-limited and so rename groups with at most x reads/writes have to be formed, which may not match what is coming out of Decode (compiler could help though).
On the rename optimization thing ("rewriting"). I am not sure I followed the idea. Could you please elaborate?
Topic | Posted By | Date |
---|---|---|
Zen 3 | Blue | 2020/10/08 09:58 AM |
Zen 3 | Rayla | 2020/10/08 10:10 AM |
Zen 3 | Adrian | 2020/10/08 10:13 AM |
Does anyone know whether Zen 3 has AVX-512? (NT) | Foo_ | 2020/10/08 11:54 AM |
Does anyone know whether Zen 3 has AVX-512? | Adrian | 2020/10/08 12:11 PM |
Zen 3 - Number of load/store units | ⚛ | 2020/10/08 10:21 AM |
Zen 3 - Number of load/store units | Rayla | 2020/10/08 10:28 AM |
Zen 3 - Number of load/store units | ⚛ | 2020/10/08 11:22 AM |
Zen 3 - Number of load/store units | Adrian | 2020/10/08 11:53 AM |
Zen 3 - Number of load/store units | Travis Downs | 2020/10/08 09:45 PM |
Zen 3 - CAD benchmark | Per Hesselgren | 2020/10/09 07:29 AM |
Zen 3 - CAD benchmark | Adrian | 2020/10/09 09:27 AM |
Zen 3 - Number of load/store units | itsmydamnation | 2020/10/08 02:38 PM |
Zen 3 - Number of load/store units | Groo | 2020/10/08 02:48 PM |
Zen 3 - Number of load/store units | Wilco | 2020/10/08 03:02 PM |
Zen 3 - Number of load/store units | Dummond D. Slow | 2020/10/08 04:39 PM |
Zen 3 - Number of load/store units | Doug S | 2020/10/09 08:11 AM |
Zen 3 - Number of load/store units | Dummond D. Slow | 2020/10/09 09:43 AM |
Zen 3 - Number of load/store units | Doug S | 2020/10/09 01:43 PM |
N7 and N7P are not load/Store units - please fix the topic in your replies (NT) | Heikki Kultala | 2020/10/10 07:37 AM |
Zen 3 | Jeff S. | 2020/10/08 12:16 PM |
Zen 3 | anon | 2020/10/08 01:57 PM |
Disappointing opening line in paper | Paul A. Clayton | 2020/10/11 06:16 AM |
Thoughts on "Improving the Utilization of µop Caches..." | Paul A. Clayton | 2020/10/14 12:11 PM |
Thoughts on "Improving the Utilization of µop Caches..." | anon | 2020/10/15 11:56 AM |
Thoughts on "Improving the Utilization of µop Caches..." | anon | 2020/10/15 11:57 AM |
Sorry about the mess | anon | 2020/10/15 11:58 AM |
Sorry about the mess | Brett | 2020/10/16 03:22 AM |
Caching dependence info in µop cache | Paul A. Clayton | 2020/10/16 06:20 AM |
Caching dependence info in µop cache | anon | 2020/10/16 12:36 PM |
Caching dependence info in µop cache | Paul A. Clayton | 2020/10/18 01:28 PM |
Zen 3 | juanrga | 2020/10/09 10:12 AM |
Zen 3 | Mr. Camel | 2020/10/09 06:30 PM |
Zen 3 | anon.1 | 2020/10/10 12:44 AM |
Cinebench is terrible benchmark | David Kanter | 2020/10/10 10:36 AM |
Cinebench is terrible benchmark | anon.1 | 2020/10/10 12:06 PM |
Cinebench is terrible benchmark | hobold | 2020/10/10 12:33 PM |
Some comments on benchmarks | Paul A. Clayton | 2020/10/14 12:11 PM |
Some comments on benchmarks | Mark Roulo | 2020/10/14 03:21 PM |
Zen 3 | Adrian | 2020/10/10 01:59 AM |
Zen 3 | Adrian | 2020/10/10 02:18 AM |
Zen 3 | majord | 2020/10/15 04:02 AM |
Zen 3 | hobold | 2020/10/10 08:58 AM |
Zen 3 | Maynard Handley | 2020/10/10 10:36 AM |
Zen 3 | hobold | 2020/10/10 12:19 PM |
Zen 3 | anon | 2020/10/11 02:58 AM |
Zen 3 | hobold | 2020/10/11 12:32 PM |
Zen 3 | anon | 2020/10/11 01:07 PM |
Zen 3 | hobold | 2020/10/11 02:22 PM |
Zen 3 | anon | 2020/10/10 11:51 AM |
Zen 3 | Michael S | 2020/10/11 01:16 AM |
Zen 3 | hobold | 2020/10/11 02:13 AM |
Zen 3 | Michael S | 2020/10/11 02:18 AM |
Zen 3 | anon.1 | 2020/10/11 12:17 PM |
Zen 3 | David Hess | 2020/10/12 06:43 AM |
more power? (NT) | anonymous2 | 2020/10/12 01:26 PM |
I think he's comparing 65W 3700X vs 105W 5800X (NT) | John H | 2020/10/12 04:33 PM |
?! Those are apples and oranges! (NT) | anon | 2020/10/12 04:49 PM |