The (apparent) state of trace caches on modern CPUs

By: Exophase (, August 22, 2016 7:55 pm
Room: Moderated Discussions
Maynard Handley ( on August 22, 2016 3:10 pm wrote:
> That sounds a HELL OF A LOT like a trace cache that dare not say its name...

It's not though. A trace cache and a post-decode cache perform two completely different functions. The former addresses fetch bandwidth/power/area and the latter decode bandwidth/power/area. Netburst's was only called a trace cache but it was actually both. Sandy Bridge does not cache traces.

Note that Sandy Bridge onward still retain a (post-decode) loop buffer as well.

> (However P4 had trace cache as 8-way, 12K µops, SB has direct mapped 1.5K µops)

The uop cache in SB and its successors is 8-way.

> The Intel slide for this
> is conveniently vague, talking as it does about an ability to "stitch" across branches in the control flow.

Read the RWT article instead:

> As always it would be interesting to know what Apple does. Both companies obviously want
> to save power and improve performance. A decoded µop cache of traces seems to be more
> helpful towards this goal than a pure decoded µop loop buffer, allowing as it does more
> code of a more varied nature than just a single straight line loop trace.

There's much less incentive to use a post-decode cache on ARM than x86, especially if it's more optimized for AArch64 (no idea if this is true yet to any extent whatsoever for Apple)

> Of course it uses more area, and how much pain is there (design performance, power) in flushing
> a trace once a branch mis-prediction is detected?

Neither uop caches nor trace caches need to flush anything on branch mispredictions.

There seems to be a misunderstanding that trace caches perform static branch prediction encoded in the trace. Maybe this is true for some other uarch but on Netburst the trace fetch has its own branch predictor. It works the same as any predictor, except instead of predicting the direction vs program order it predicts it vs trace order. The branch is (probably substantially) more likely to look untaken than the non-traced version because the ordering is chosen based on what the decode-time predictor chose instead of what the programmer or compiler chose, and therefore the traced version is more likely to avoid short fetches and taken branch stalls. Nonetheless, if the prediction goes against the trace the penalty is similar to what it would be in a non-trace cache.

Traces may end up flushed or modified periodically to better align with the predictors, but this is only an optimization. If done sensibly anyway.

> Intel obviously has MORE incentive to
> go down this path than Apple, given the higher level of pain in their decode.
> But it seems like the potential win here is substantial. The same thesis puts the power used
> by branch prediction, fetch, and decode on an ARM A15 at ~40% of CPU power. If you can
> reduce that to, I don't know, 10% of power while executing out of such a cache 80%
> of the time [SB claimed a hit rate of ~80%], that's a very nice win. So I wonder
> if Apple has done much the same thing, only with power as their primary driver?

The problem with trace caches is that they result in a lot of code duplication. Intel said that Netburst's 12k uop cache has a hit rate comparable to an 8K-16K L1 icache. If x86 instructions are about 4 bytes and were about 1.5 uops each on Netburst that'd mean a 2-4x overhead from duplication. That's pretty bad. You could mitigate it by performing less actual tracing when it's detected to lead to duplication but that'll remove some of the benefit of the trace cache. Maybe there's a sweet spot that's significantly better than what Intel hit.

Trace caches also need to perform a mapping to convert architectural instruction addresses to trace addresses, which isn't totally trivial and therefore eats more space and power. And they don't like self-modifying code at all. Netburst flushed the entire thing.. you can try to track or scan for overlapping portions to selectively flush only the affected lines but there's overhead in doing this.

Netburst also had to spend area distributing branch prediction resources to the trace fetch and slow fetch paths.

The problem with post-decode caches is that stored uops tend to take up more space than their source instructions, so fetch width goes up and consumes more power. And there's still some mapping needed to find out the instruction offset within a uop cache line, although it shouldn't be as bad as the mapping for a trace cache.

Netburst really got the worst of both worlds on size. The original design had an 80KB trace cache and Prescott had a 96KB one. For an icache structure with the hit rate of an 8-16KB conventional icache that's really bad! Since Netburst uop sizes have surely only gone up, quite likely by a whole lot - I wouldn't be terribly sized if SB's uop cache was physically as large as the L1 icache. An AArch64 CPU wouldn't necessarily need such big uops, but things like immediate fusion and branch fusion result in larger uops, not to mention a uop format that can efficiently support both AArch32 and AArch64 instructions.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Branch/jump target predictionTravis2016/08/09 09:44 AM
  Early decode of unconditional jumpsPeter Cordes2016/08/09 11:35 AM
    Early decode of unconditional jumpsExophase2016/08/09 12:29 PM
  pipelines are too long, noHeikki Kultala2016/08/09 11:37 AM
    pipelines are too long, nono name2016/08/09 06:17 PM
      pipelines are too long, noWilco2016/08/10 01:43 AM
        pipelines are too long, noPaul A. Clayton2016/08/10 07:44 PM
    Converged BTB/IcachePaul A. Clayton2016/08/10 07:44 PM
  Branch/jump target predictionsylt2016/08/10 02:27 AM
    Branch/jump target predictionPeter Cordes2016/08/12 03:23 PM
      Branch/jump target predictionsylt2016/08/12 10:35 PM
  Branch/jump target predictionMr. Camel2016/08/10 09:43 AM
    Branch/jump target predictionLinus Torvalds2016/08/10 11:46 AM
      Branch/jump target predictionMegol2016/08/10 02:25 PM
        Branch/jump target predictionLinus Torvalds2016/08/10 04:14 PM
          Branch/jump target predictionDavid Kanter2016/08/11 11:09 PM
            Branch/jump target predictionLinus Torvalds2016/08/12 11:25 AM
          Branch/jump target prediction2016/08/14 04:24 AM
            Branch/jump target predictionMaynard Handley2016/08/14 06:47 AM
              Branch/jump target predictionDavid Kanter2016/08/14 07:13 AM
              Branch/jump target prediction2016/08/16 05:19 AM
            Branch/jump target predictionTim McCaffrey2016/08/14 07:12 AM
              Branch/jump target predictionDavid Kanter2016/08/14 07:18 AM
                Branch/jump target predictionGabriele Svelto2016/08/14 01:09 PM
            Just a thoughtAnon2016/08/14 09:40 AM
              Just a thought2016/08/16 05:58 AM
                Just a thoughtAnon2016/08/16 07:45 AM
                  Just a thought2016/08/16 08:36 AM
            Branch/jump target predictionLinus Torvalds2016/08/14 09:40 AM
              Branch/jump target prediction2016/08/16 05:40 AM
                Branch/jump target predictionRicardo B2016/08/16 06:39 AM
                  Branch/jump target prediction -82016/08/16 08:23 AM
                    Branch/jump target prediction -8anon2016/08/16 09:09 AM
                    Branch/jump target prediction -8Ricardo B2016/08/16 09:33 AM
                      Branch/jump target prediction -8Exophase2016/08/16 10:02 AM
                        Branch/jump target prediction -8Ricardo B2016/08/16 10:31 AM
                        SPU hbr instruction (hint for branch)vvid2016/08/16 11:31 AM
                        Branch/jump target prediction -8no name2016/08/17 07:16 AM
                    Branch/jump target prediction -8Gabriele Svelto2016/08/16 10:46 AM
                      Branch/jump target prediction -8Etienne2016/08/17 12:27 AM
                        Branch/jump target prediction -8Gabriele Svelto2016/08/17 02:52 AM
                    Branch/jump target prediction -8Maynard Handley2016/08/18 09:02 AM
                      Branch/jump target prediction -82016/08/18 05:21 PM
                        Branch/jump target prediction -8Maynard Handley2016/08/18 06:27 PM
                          Branch/jump target prediction -8Megol2016/08/19 03:29 AM
                          Part 1/N - CPU-internal JIT2016/08/19 03:44 AM
                        Atom, you're such a comedian.Jim Trent2016/08/18 09:39 PM
                          Atom, you're such a comedian.2016/08/19 02:23 AM
                      Branch/jump target prediction -8Etienne2016/08/19 12:25 AM
                        Branch/jump target prediction -8Simon Farnsworth2016/08/19 03:17 AM
                          Branch/jump target prediction -8Michael S2016/08/19 05:39 AM
                          Branch/jump target prediction -8anon2016/08/19 06:29 AM
                            Branch/jump target prediction -8Simon Farnsworth2016/08/19 07:34 AM
                              Branch/jump target prediction -8anon2016/08/19 07:48 AM
                                Branch/jump target prediction -8Exophase2016/08/19 10:03 AM
                                Branch/jump target prediction -8Maynard Handley2016/08/19 10:34 AM
                            Branch/jump target prediction -8David Kanter2016/08/19 11:23 PM
                        Branch/jump target prediction -8Ricardo B2016/08/19 06:18 AM
                          Branch/jump target prediction -8Maynard Handley2016/08/19 07:41 AM
                            Branch/jump target prediction -8Michael S2016/08/19 08:26 AM
                              Branch/jump target prediction -8Maynard Handley2016/08/19 12:47 PM
                                Branch/jump target prediction -8Michael S2016/08/21 12:53 AM
                                  Branch/jump target prediction -8Ricardo B2016/08/22 04:17 AM
                                    Branch/jump target prediction -8Michael S2016/08/22 04:58 AM
                                      Branch/jump target prediction -8Ricardo B2016/08/22 06:50 AM
                            Branch/jump target prediction -8Simon Farnsworth2016/08/19 08:28 AM
                              Branch/jump target prediction -8Simon Farnsworth2016/08/19 08:40 AM
                            Branch/jump target prediction -8David Kanter2016/08/22 11:05 PM
                              Branch/jump target prediction -8Maynard Handley2016/08/23 06:49 AM
                      Branch/jump target prediction -8anon2016/08/26 07:00 AM
                        Branch/jump target prediction -8anon2016/08/26 07:14 AM
                Branch/jump target predictionMegol2016/08/19 03:23 AM
          Branch/jump target predictionMegol2016/08/19 06:42 AM
            Branch/jump target predictionMaynard Handley2016/08/19 10:46 AM
              Branch/jump target predictionDavid Kanter2016/08/19 11:34 PM
                Branch/jump target predictionMaynard Handley2016/08/20 06:07 AM
            Branch/jump target predictionsylt2016/08/19 10:48 AM
              Branch/jump target predictionsylt2016/08/19 11:00 AM
              Branch/jump target predictionMegol2016/08/21 09:27 AM
                The (apparent) state of trace caches on modern CPUsMaynard Handley2016/08/22 02:10 PM
                  The (apparent) state of trace caches on modern CPUsExophase2016/08/22 07:55 PM
                    The (apparent) state of trace caches on modern CPUsanon2016/08/22 11:36 PM
                      The (apparent) state of trace caches on modern CPUsExophase2016/08/23 04:08 AM
                        The (apparent) state of trace caches on modern CPUsanon2016/08/23 08:51 PM
                          The (apparent) state of trace caches on modern CPUsExophase2016/08/23 10:12 PM
                          The (apparent) state of trace caches on modern CPUsMaynard Handley2016/08/24 06:38 AM
                            The (apparent) state of trace caches on modern CPUsanon2016/08/24 07:26 PM
                    The (apparent) state of trace caches on modern CPUsMaynard Handley2016/08/23 06:48 AM
                      That's not trueDavid Kanter2016/08/23 08:39 AM
                        That's not trueMaynard Handley2016/08/23 08:56 AM
                      The (apparent) state of trace caches on modern CPUsanon2016/08/23 08:54 PM
                  The (wrong) state of trace caches on modern CPUsEric Bron2016/08/25 01:38 AM
                    The (wrong) state of trace caches on modern CPUsMichael S2016/08/25 02:28 AM
                      The (wrong) state of trace caches on modern CPUsEric Bron2016/08/25 06:12 AM
                      The (wrong) state of trace caches on modern CPUsMaynard Handley2016/08/25 08:50 AM
                        The (wrong) state of trace caches on modern CPUsMichael S2016/08/25 09:36 AM
                          The (wrong) state of trace caches on modern CPUsExophase2016/08/25 10:32 AM
                        The (wrong) state of trace caches on modern CPUsEric Bron2016/08/25 10:12 AM
                          The (wrong) state of trace caches on modern CPUsMaynard Handley2016/08/25 11:01 AM
                            The (wrong) state of trace caches on modern CPUsEric Bron2016/08/25 11:20 AM
                              The (wrong) state of trace caches on modern CPUsMaynard Handley2016/08/25 12:34 PM
        Branch/jump target predictionGabriele Svelto2016/08/11 12:15 PM
  Branch/jump target predictionGabriele Svelto2016/08/20 06:21 AM
Reply to this Topic
Body: No Text
How do you spell tangerine? 🍊