Caching dependence info in µop cache

By: Paul A. Clayton (, October 18, 2020 1:28 pm
Room: Moderated Discussions
anon ( on October 16, 2020 12:36 pm wrote:

> Thank you, I think I got it. I will point out that it could be possible to implement this directly at the I-Cache
> by rewriting a custom format on a line fill if the instruction format permits it.

Yes, and this would seem likely to be very messy for x86.

For an Icache one would also probably want to consider basic block boundaries. (One alternative might be to use fetch chunk alignment with simple reversal of the renaming simplification when jumping into the middle.)

The boundary between predecoded Icache and µop cache may be a bit fuzzy. (I suspect the boundary between trace cache and µop cache may also be fuzzy.)

> If not, then doing it in the
> uop cache does seem like the place, even though in practice it becomes hard to tell if it will impact cycle time
> or not. Essentially, another level of choice is added at Rename : Who uses the free list and who uses the RAT
> (for sources). It might not be a big deal but it means that each "lane" from Decode has to enable a different
> circuit whereas in the baseline each "lane" would go straight to the RAT period (for sources).

The free list readers would be necessary anyway (and since they are dependent operations they can be delayed).

With fetch based on a larger chunk that does not cross instruction boundaries (with trailing immediates not counted as part of an instruction), it seems one could rearrange instruction components to merge like portions if that simplified routing.

> Indexed wake up on the scheduler has been argued to be much more power efficient (e.g. TRIPS), but in this
> case, it would be added on top of the baseline broadcast wake up and to so it sounds like a hard sell to
> the circuit designer, especially for something that would only work if the source and the last destination
> are in the same dispatch bundle. Maybe an indexed write port used for dispatch could be stolen...

Result consumers tend to be close to result producers (for single cycle operations this has obvious benefits, compilation specifically targeting an out-of-order implementation might also increase proximity of consumer and producer), so (for some workloads) proximity might be a common case.

This might be a design aspect that works well with CAM-sharing designs intended to exploit variable numbers of not-ready operands; an index-triggered operation would be a zero-CAM-operand operation leaving two (or three) CAMs associated with a different operation. (Paired resource sharing might be practical, but one can imagine more funky sharing where each operand tag that can be allocated to one operation can be used by a different operation while still using physically adjacent connection.)

I think there was also research in using indexing for dependencies of operations dependent on a data cache miss (where there is time to look at dependencies and move tracking information around).

I do wonder if a limited dependency network (less general than all-to-all) that would map well to "typical" programs exists that would be justifiable by area and power savings. There should also be ways to exploit recurring schedules (similar to software pipelining of loops — this has been proposed academically) and correlated latencies (cache misses to the same cache block, cache misses for "old" or first-use misses — with I/O coming through the processor chip and potentially being cached, it is not clear anything could be classified as a first-use miss).

Reducing the amount of routing, comparison, and selection work may be a useful goal, but seeking to exploit higher level (e.g., task) parallelism might be more productive. (I do not think the metaphor of fighting the previous war quite applies, but priorities and tradeoffs do seem to shift.)

With arbitrary code, specialized design will typically introduce lower performance in some cases. Since the x86 market places significant value on performance compatibility specializations (or even just changing the 95% of existing code that runs well or how poorly the 5% runs) are probably less attractive.

[The above was a bit more digressive than usual even for me, but some of the lurkers might benefit a little.]
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Zen 3Blue2020/10/08 09:58 AM
  Zen 3Rayla2020/10/08 10:10 AM
  Zen 3Adrian2020/10/08 10:13 AM
    Does anyone know whether Zen 3 has AVX-512? (NT)Foo_2020/10/08 11:54 AM
      Does anyone know whether Zen 3 has AVX-512?Adrian2020/10/08 12:11 PM
  Zen 3 - Number of load/store units2020/10/08 10:21 AM
    Zen 3 - Number of load/store unitsRayla2020/10/08 10:28 AM
      Zen 3 - Number of load/store units2020/10/08 11:22 AM
        Zen 3 - Number of load/store unitsAdrian2020/10/08 11:53 AM
          Zen 3 - Number of load/store unitsTravis Downs2020/10/08 09:45 PM
          Zen 3 - CAD benchmarkPer Hesselgren2020/10/09 07:29 AM
            Zen 3 - CAD benchmarkAdrian2020/10/09 09:27 AM
        Zen 3 - Number of load/store unitsitsmydamnation2020/10/08 02:38 PM
          Zen 3 - Number of load/store unitsGroo2020/10/08 02:48 PM
            Zen 3 - Number of load/store unitsWilco2020/10/08 03:02 PM
              Zen 3 - Number of load/store unitsDummond D. Slow2020/10/08 04:39 PM
                Zen 3 - Number of load/store unitsDoug S2020/10/09 08:11 AM
                  Zen 3 - Number of load/store unitsDummond D. Slow2020/10/09 09:43 AM
                    Zen 3 - Number of load/store unitsDoug S2020/10/09 01:43 PM
                      N7 and N7P are not load/Store units - please fix the topic in your replies (NT)Heikki Kultala2020/10/10 07:37 AM
  Zen 3Jeff S.2020/10/08 12:16 PM
    Zen 3anon2020/10/08 01:57 PM
    Disappointing opening line in paperPaul A. Clayton2020/10/11 06:16 AM
      Thoughts on "Improving the Utilization of µop Caches..."Paul A. Clayton2020/10/14 12:11 PM
        Thoughts on "Improving the Utilization of µop Caches..."anon2020/10/15 11:56 AM
          Thoughts on "Improving the Utilization of µop Caches..."anon2020/10/15 11:57 AM
            Sorry about the messanon2020/10/15 11:58 AM
              Sorry about the messBrett2020/10/16 03:22 AM
          Caching dependence info in µop cachePaul A. Clayton2020/10/16 06:20 AM
            Caching dependence info in µop cacheanon2020/10/16 12:36 PM
              Caching dependence info in µop cachePaul A. Clayton2020/10/18 01:28 PM
  Zen 3juanrga2020/10/09 10:12 AM
  Zen 3Mr. Camel2020/10/09 06:30 PM
    Zen 3anon.12020/10/10 12:44 AM
      Cinebench is terrible benchmarkDavid Kanter2020/10/10 10:36 AM
        Cinebench is terrible benchmarkanon.12020/10/10 12:06 PM
        Cinebench is terrible benchmarkhobold2020/10/10 12:33 PM
          Some comments on benchmarksPaul A. Clayton2020/10/14 12:11 PM
            Some comments on benchmarksMark Roulo2020/10/14 03:21 PM
    Zen 3Adrian2020/10/10 01:59 AM
      Zen 3Adrian2020/10/10 02:18 AM
        Zen 3majord2020/10/15 04:02 AM
  Zen 3hobold2020/10/10 08:58 AM
    Zen 3Maynard Handley2020/10/10 10:36 AM
      Zen 3hobold2020/10/10 12:19 PM
        Zen 3anon2020/10/11 02:58 AM
          Zen 3hobold2020/10/11 12:32 PM
            Zen 3anon2020/10/11 01:07 PM
              Zen 3hobold2020/10/11 02:22 PM
    Zen 3anon2020/10/10 11:51 AM
    Zen 3Michael S2020/10/11 01:16 AM
      Zen 3hobold2020/10/11 02:13 AM
        Zen 3Michael S2020/10/11 02:18 AM
      Zen 3anon.12020/10/11 12:17 PM
  Zen 3David Hess2020/10/12 06:43 AM
    more power? (NT)anonymous22020/10/12 01:26 PM
      I think he's comparing 65W 3700X vs 105W 5800X (NT)John H2020/10/12 04:33 PM
        ?! Those are apples and oranges! (NT)anon2020/10/12 04:49 PM
Reply to this Topic
Body: No Text
How do you spell tangerine? 🍊