Thoughts on "Improving the Utilization of µop Caches..."

By: Paul A. Clayton (paaronclayton.delete@this.gmail.com), October 14, 2020 12:11 pm
Room: Moderated Discussions
I did learn a few things from reading "Improving the Utilization of Micro-operation
Caches in x86 Processors" (Jagadish B. Kotra and John Kalamatianos, 2020, PDF). I had read before that µop cache entries typically terminated at a naturally aligned boundary in the Icache, either the cache line or a smaller naturally aligned boundary. I was not aware that (large) immediates were placed in a separate section of the µop cache line. (Special handling makes sense since large immediates are used in a minority of µops. I suspect the authors were unaware of Heidi Pan's heads-and-tails proposal, which is highly reminiscent of the fill toward the middle presented for µop cache lines, especially since they do not present this as a compression mechansim — avoiding huge fixed-size µops with 64-bit immediate fields that are rarely used.)

I was also not aware that entries terminated at a predicted taken branch. I would have guessed an entry would extend at least to the last full instruction in a fetch chunk. (Conceivably one could even pack the raw partial instruction into an entry, but the extra complexity of handling the different storage — undecoded instruction data would have to be sent backward in the pipeline to the decoder — might not be justified by the small benefit of limited prefetching of a sometimes followed branch fall-through path or a function return.)

(I vaguely recall one of David Kanter's articles indicating that multiple ways were filled to supply extra capacity for a µop cache entry. Glancing at Agner Fog's description of Sandy Bridge's µop cache, it does not match well with the paper's design.)

It is not clear if the example sizes of uops (56 bits) and µop cache lines (64 bytes) were typical. With large immediates removed, I would have guessed µops would be smaller (though obviously such depends on how much decoding one allows after the µop cache and other factors). I would also have guessed that line size would not be a power of two bytes. With 7-byte µops and 4-byte immediate chunks, 64-byte lines would always have wasted storage (though this does not mean that such is necessarily less efficient is does draw the question); AMD or Intel could surely afford to spend a little design effort to save a few percent of µop cache area.

(I would also have guessed that larger immediates would be extended from the µop internal immediate field. 16-bit immediate chunks — one internal to the µop and zero to three external — would not add much complexity and potentially increase effective storage. This might be implemented and the paper does mention that standard code compression techniques could be applied to a µop cache.)

I consider Cache Line boundary AgnoStic design (CLASP) somewhat obvious (and a bad attempt at forming an acronym) when entries are allowed to be larger than an fetch chunk. I.e., if one has a buffer containing an extended basic block larger than a fetch chunk, extending such beyond the cache line boundary (and not just the fetch chunk boundary) seems a modest step. The coherence problem is easily handled by probing two sets as the paper notes. It does have the advantage of increasing bandwidth (besides increasing storage utilization).

The concept of compaction is (as noted in the paper) closely related to some cache compression proposals where two compressed cache blocks fit into a single line. (I would actually propose using grow inward from both sides and a storage chunk shared between two sets with the size of the storage chunk probably being somewhat larger than twice the average entry size. Such could use the full line for a single entry, allowing greater extension for larger extended basic blocks. One could also support variable line size, especially with high associativity and overlaid skewed associativity — overlaid skewed associativity would increase the opportunities for a particular address to find a better size match. Yes, I have OSA-on-the-brain disease.)

The replacement mechanism modifications were also somewhat straightforward yet creative. "Replacement Aware Compaction" simply prefers fitting an entry according to recency of existing entry's use rather than best fit to allow tracking recency of use per line with less retention of old entries that use only part of a line. (With grow inward sharing of storage and two distinct 'lines', having separate replacement tracking would seem less burdensome. One could also consider overwriting the end of an extended basic block to provide more room while still allowing part of an older entry to be retained.)

For "Prediction Window-Aware Compaction", it was not clear if the 31.6% of Prediction Windows containing two µop cache entries was after the CLASP enhancement or using the baseline design. In their design the µop count limit is set by line size (so no compaction opporunity would be presented). CLASP handles cache line boundaries. Maximum number of immediates could be easily extended by slightly increasing the size of µops, which might be easier/less overhead than the proposed compaction or at least comparable. "maximum number of micro-coded µops" (i.e., instructions decoded to four or more µops) seems unlikely to be a common entry termination condition. PWAC seems underwhelming, but perhaps I am missing something. (I would have thought that the microcoded instructions could be "easily" decode as 'internal function calls' rather than partially filling a entry and terminating the entry early. Supporting such might make supporting returns to the inside of a µop cache line easier, at least for leaf functions; a small buffer could hold the post-return µops whether the return was from a microcode sequence or an ordinary function call. If the called function is a cache miss, additional instructions might be fetched and decoded beyond the fetch chunk containing the call instruction with only modest cost, though it might be easier to just stop decode at the fetch chunk.)

Modeling using true LRU replacement seems unrealistic. (Greater associativity in the Icache than the Dcache also seems odd.)

The use of bad case benchmarks is normal (at least I assume that is what "These workloads exhibit significant under-utilization of micro-op cache due to fragmentation." means, otherwise something like "these workloads represent common under-utilization" would have been used, I am guessing), but measuring the normal case benefit (or harm) would have been nice if the simulation time was not too expensive.

With µop cache entries being extended basic blocks, I would have guessed that rename optimization would be applied. (One can determine which sources are actually written by previous instructions in the extended basic block and replace them with destination numbers, removing the need to check dependencies. One could also re-order µops, at least within a basic block, to reduce routing energy or provide some other benefit. If a banked register alias table was used, bank conflicts could be cached [such might also introduce another cause of entry termination — if bandwidth is limited at the RAT, higher µop cache fetch bandwidth would be less useful].)

I do not mean to be unfairly negative (and I hope my negative comments are not cruel or even just insensitive). I know it is easier to find fault than to provide superior alternatives, easy to ignore limited resources available for research, and easy to see creative ideas as less stunningly creative in hindsight. Since I have not written a single research paper (not even a survey), I am likely less sensitive to the difficulties. My expectations may also be unrealistic; not every expert in the field is more competent than I am in every aspect of the field (even though I am just a hobbyist), not every paper can be so clearly written that readers feel like geniuses because it was so easy to understand, not every paper is so thorough and well organized that readers feel the authors are reading their minds when objections or enhancements are addressed just as they come to the reader's mind.

Perhaps something positive will come from this post.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Zen 3Blue2020/10/08 09:58 AM
  Zen 3Rayla2020/10/08 10:10 AM
  Zen 3Adrian2020/10/08 10:13 AM
    Does anyone know whether Zen 3 has AVX-512? (NT)Foo_2020/10/08 11:54 AM
      Does anyone know whether Zen 3 has AVX-512?Adrian2020/10/08 12:11 PM
  Zen 3 - Number of load/store units2020/10/08 10:21 AM
    Zen 3 - Number of load/store unitsRayla2020/10/08 10:28 AM
      Zen 3 - Number of load/store units2020/10/08 11:22 AM
        Zen 3 - Number of load/store unitsAdrian2020/10/08 11:53 AM
          Zen 3 - Number of load/store unitsTravis Downs2020/10/08 09:45 PM
          Zen 3 - CAD benchmarkPer Hesselgren2020/10/09 07:29 AM
            Zen 3 - CAD benchmarkAdrian2020/10/09 09:27 AM
        Zen 3 - Number of load/store unitsitsmydamnation2020/10/08 02:38 PM
          Zen 3 - Number of load/store unitsGroo2020/10/08 02:48 PM
            Zen 3 - Number of load/store unitsWilco2020/10/08 03:02 PM
              Zen 3 - Number of load/store unitsDummond D. Slow2020/10/08 04:39 PM
                Zen 3 - Number of load/store unitsDoug S2020/10/09 08:11 AM
                  Zen 3 - Number of load/store unitsDummond D. Slow2020/10/09 09:43 AM
                    Zen 3 - Number of load/store unitsDoug S2020/10/09 01:43 PM
                      N7 and N7P are not load/Store units - please fix the topic in your replies (NT)Heikki Kultala2020/10/10 07:37 AM
  Zen 3Jeff S.2020/10/08 12:16 PM
    Zen 3anon2020/10/08 01:57 PM
    Disappointing opening line in paperPaul A. Clayton2020/10/11 06:16 AM
      Thoughts on "Improving the Utilization of µop Caches..."Paul A. Clayton2020/10/14 12:11 PM
        Thoughts on "Improving the Utilization of µop Caches..."anon2020/10/15 11:56 AM
          Thoughts on "Improving the Utilization of µop Caches..."anon2020/10/15 11:57 AM
            Sorry about the messanon2020/10/15 11:58 AM
              Sorry about the messBrett2020/10/16 03:22 AM
          Caching dependence info in µop cachePaul A. Clayton2020/10/16 06:20 AM
            Caching dependence info in µop cacheanon2020/10/16 12:36 PM
              Caching dependence info in µop cachePaul A. Clayton2020/10/18 01:28 PM
  Zen 3juanrga2020/10/09 10:12 AM
  Zen 3Mr. Camel2020/10/09 06:30 PM
    Zen 3anon.12020/10/10 12:44 AM
      Cinebench is terrible benchmarkDavid Kanter2020/10/10 10:36 AM
        Cinebench is terrible benchmarkanon.12020/10/10 12:06 PM
        Cinebench is terrible benchmarkhobold2020/10/10 12:33 PM
          Some comments on benchmarksPaul A. Clayton2020/10/14 12:11 PM
            Some comments on benchmarksMark Roulo2020/10/14 03:21 PM
    Zen 3Adrian2020/10/10 01:59 AM
      Zen 3Adrian2020/10/10 02:18 AM
        Zen 3majord2020/10/15 04:02 AM
  Zen 3hobold2020/10/10 08:58 AM
    Zen 3Maynard Handley2020/10/10 10:36 AM
      Zen 3hobold2020/10/10 12:19 PM
        Zen 3anon2020/10/11 02:58 AM
          Zen 3hobold2020/10/11 12:32 PM
            Zen 3anon2020/10/11 01:07 PM
              Zen 3hobold2020/10/11 02:22 PM
    Zen 3anon2020/10/10 11:51 AM
    Zen 3Michael S2020/10/11 01:16 AM
      Zen 3hobold2020/10/11 02:13 AM
        Zen 3Michael S2020/10/11 02:18 AM
      Zen 3anon.12020/10/11 12:17 PM
  Zen 3David Hess2020/10/12 06:43 AM
    more power? (NT)anonymous22020/10/12 01:26 PM
      I think he's comparing 65W 3700X vs 105W 5800X (NT)John H2020/10/12 04:33 PM
        ?! Those are apples and oranges! (NT)anon2020/10/12 04:49 PM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell tangerine? 🍊