The (apparent) state of trace caches on modern CPUs

By: Maynard Handley (, August 22, 2016 2:10 pm
Room: Moderated Discussions
Pursuant to this whole subject, people might find the following claim (from Mitchell Hayenga's thesis - 2013, UW Madison,
Within industry, the majority of recently introduced high performance microprocessors support decoded loop buffers. ...
With the Intel Sandybridge [45] processor, Intel introduced a μop cache instead of a loop buffer. μop caches tradeoff some of the power efficiency of loop caches in exchange for capturing more instructions and behaviors. Thus codes which frequent and simple loops may be better served by a traditional loop cache, however μop caches are more robust and able to derive benefit more irregular codes. Essentially, μop caches operate as traditional caches which hold decoded instructions. However, they share some characteristics with loop caches. [b]In current commercial implementations, μop caches encode predicted branch paths.[/b] If branch paths differ from previously predicted paths, like loop caches the μop cache must be flushed and refilled.

That sounds a HELL OF A LOT like a trace cache that dare not say its name...
(However P4 had trace cache as 8-way, 12K µops, SB has direct mapped 1.5K µops)

The Intel slide for this
is conveniently vague, talking as it does about an ability to "stitch" across branches in the control flow.

As always it would be interesting to know what Apple does. Both companies obviously want to save power and improve performance. A decoded µop cache of traces seems to be more helpful towards this goal than a pure decoded µop loop buffer, allowing as it does more code of a more varied nature than just a single straight line loop trace.

Of course it uses more area, and how much pain is there (design performance, power) in flushing a trace once a branch mis-prediction is detected? Intel obviously has MORE incentive to go down this path than Apple, given the higher level of pain in their decode.
But it seems like the potential win here is substantial. The same thesis puts the power used by branch prediction, fetch, and decode on an ARM A15 at ~40% of CPU power. If you can
reduce that to, I don't know, 10% of power while executing out of such a cache 80% of the time [SB claimed a hit rate of ~80%], that's a very nice win. So I wonder if Apple has done much the same thing, only with power as their primary driver?

Megol ( on August 21, 2016 10:27 am wrote:
> sylt ( on August 19, 2016 11:48 am wrote:
> > Megol ( on August 19, 2016 7:42 am wrote:
> > > Okay I misunderstood. While data instruction cache coherency is indeed a problem in
> > > the P4 it is most commonly referred to as a problem for self-modifying code. What is
> > > commonly called coherency is keeping caches on different processors/cores updated.
> > >
> >
> > Although I don't know much about the P4 trace cache specifically I think the main point here was that the
> > processor never knows *IF* there is a need for coherency
> > (i.e. self modifying code) or not. Coherency always
> > comes into play regardless of if it is needed by the application or not. If the coherency is complicated
> > and thus maybe a bit pessimistically implemented you could have lots of false evictions for code that is
> > not really self modifying but maybe merely wrote some "near miss" data or similar pitfalls.
> It is problematic - but again mostly because the P4 was a bad design (my personal opinion - some
> parts of it was beautiful but most were at the level of "what the f**k were they thinking").
> --
> Let's compare to a IMHO better design with a trace cache: Have a reasonably sized, snooping
> L1 I cache followed by 3-4 legacy format decoders followed by trace creation logic (which
> can be a mix of hardware and software) and then a reasonably sized L0 I cache of the
> trace type followed by internal format decoders and then issue logic etc.
> Code that doesn't hit in the trace cache will go through the L1 cache and then the legacy
> decoders. A branch mispredict that doesn't hit the trace cache will skip the trace creation
> logic and the trace cache/internal decoders which reduces restart latency.
> Self-modifying code etc. will require flushing of traces (theoretically one could do a trace
> repair but that is probably to complex for real machines) _but_ while the trace cache is flushed
> the execution unit can use the traditional fetch mechanism with similar delays that current
> x86 machines have - unless of course the traditional front end have been powered down.
> Sounds like a decoded µop cache? It is similar but not the same thing. The decoded µop
> cache is much easier to add to an existing design and is more conservative as proper trace
> creation isn't needed. It can shave some complexities in decoding x86 instruction too.
> But it also doesn't get the core advantages of a trace cache
> as a way to "solve" the instruction fetch problem.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Branch/jump target predictionTravis2016/08/09 09:44 AM
  Early decode of unconditional jumpsPeter Cordes2016/08/09 11:35 AM
    Early decode of unconditional jumpsExophase2016/08/09 12:29 PM
  pipelines are too long, noHeikki Kultala2016/08/09 11:37 AM
    pipelines are too long, nono name2016/08/09 06:17 PM
      pipelines are too long, noWilco2016/08/10 01:43 AM
        pipelines are too long, noPaul A. Clayton2016/08/10 07:44 PM
    Converged BTB/IcachePaul A. Clayton2016/08/10 07:44 PM
  Branch/jump target predictionsylt2016/08/10 02:27 AM
    Branch/jump target predictionPeter Cordes2016/08/12 03:23 PM
      Branch/jump target predictionsylt2016/08/12 10:35 PM
  Branch/jump target predictionMr. Camel2016/08/10 09:43 AM
    Branch/jump target predictionLinus Torvalds2016/08/10 11:46 AM
      Branch/jump target predictionMegol2016/08/10 02:25 PM
        Branch/jump target predictionLinus Torvalds2016/08/10 04:14 PM
          Branch/jump target predictionDavid Kanter2016/08/11 11:09 PM
            Branch/jump target predictionLinus Torvalds2016/08/12 11:25 AM
          Branch/jump target prediction2016/08/14 04:24 AM
            Branch/jump target predictionMaynard Handley2016/08/14 06:47 AM
              Branch/jump target predictionDavid Kanter2016/08/14 07:13 AM
              Branch/jump target prediction2016/08/16 05:19 AM
            Branch/jump target predictionTim McCaffrey2016/08/14 07:12 AM
              Branch/jump target predictionDavid Kanter2016/08/14 07:18 AM
                Branch/jump target predictionGabriele Svelto2016/08/14 01:09 PM
            Just a thoughtAnon2016/08/14 09:40 AM
              Just a thought2016/08/16 05:58 AM
                Just a thoughtAnon2016/08/16 07:45 AM
                  Just a thought2016/08/16 08:36 AM
            Branch/jump target predictionLinus Torvalds2016/08/14 09:40 AM
              Branch/jump target prediction2016/08/16 05:40 AM
                Branch/jump target predictionRicardo B2016/08/16 06:39 AM
                  Branch/jump target prediction -82016/08/16 08:23 AM
                    Branch/jump target prediction -8anon2016/08/16 09:09 AM
                    Branch/jump target prediction -8Ricardo B2016/08/16 09:33 AM
                      Branch/jump target prediction -8Exophase2016/08/16 10:02 AM
                        Branch/jump target prediction -8Ricardo B2016/08/16 10:31 AM
                        SPU hbr instruction (hint for branch)vvid2016/08/16 11:31 AM
                        Branch/jump target prediction -8no name2016/08/17 07:16 AM
                    Branch/jump target prediction -8Gabriele Svelto2016/08/16 10:46 AM
                      Branch/jump target prediction -8Etienne2016/08/17 12:27 AM
                        Branch/jump target prediction -8Gabriele Svelto2016/08/17 02:52 AM
                    Branch/jump target prediction -8Maynard Handley2016/08/18 09:02 AM
                      Branch/jump target prediction -82016/08/18 05:21 PM
                        Branch/jump target prediction -8Maynard Handley2016/08/18 06:27 PM
                          Branch/jump target prediction -8Megol2016/08/19 03:29 AM
                          Part 1/N - CPU-internal JIT2016/08/19 03:44 AM
                        Atom, you're such a comedian.Jim Trent2016/08/18 09:39 PM
                          Atom, you're such a comedian.2016/08/19 02:23 AM
                      Branch/jump target prediction -8Etienne2016/08/19 12:25 AM
                        Branch/jump target prediction -8Simon Farnsworth2016/08/19 03:17 AM
                          Branch/jump target prediction -8Michael S2016/08/19 05:39 AM
                          Branch/jump target prediction -8anon2016/08/19 06:29 AM
                            Branch/jump target prediction -8Simon Farnsworth2016/08/19 07:34 AM
                              Branch/jump target prediction -8anon2016/08/19 07:48 AM
                                Branch/jump target prediction -8Exophase2016/08/19 10:03 AM
                                Branch/jump target prediction -8Maynard Handley2016/08/19 10:34 AM
                            Branch/jump target prediction -8David Kanter2016/08/19 11:23 PM
                        Branch/jump target prediction -8Ricardo B2016/08/19 06:18 AM
                          Branch/jump target prediction -8Maynard Handley2016/08/19 07:41 AM
                            Branch/jump target prediction -8Michael S2016/08/19 08:26 AM
                              Branch/jump target prediction -8Maynard Handley2016/08/19 12:47 PM
                                Branch/jump target prediction -8Michael S2016/08/21 12:53 AM
                                  Branch/jump target prediction -8Ricardo B2016/08/22 04:17 AM
                                    Branch/jump target prediction -8Michael S2016/08/22 04:58 AM
                                      Branch/jump target prediction -8Ricardo B2016/08/22 06:50 AM
                            Branch/jump target prediction -8Simon Farnsworth2016/08/19 08:28 AM
                              Branch/jump target prediction -8Simon Farnsworth2016/08/19 08:40 AM
                            Branch/jump target prediction -8David Kanter2016/08/22 11:05 PM
                              Branch/jump target prediction -8Maynard Handley2016/08/23 06:49 AM
                      Branch/jump target prediction -8anon2016/08/26 07:00 AM
                        Branch/jump target prediction -8anon2016/08/26 07:14 AM
                Branch/jump target predictionMegol2016/08/19 03:23 AM
          Branch/jump target predictionMegol2016/08/19 06:42 AM
            Branch/jump target predictionMaynard Handley2016/08/19 10:46 AM
              Branch/jump target predictionDavid Kanter2016/08/19 11:34 PM
                Branch/jump target predictionMaynard Handley2016/08/20 06:07 AM
            Branch/jump target predictionsylt2016/08/19 10:48 AM
              Branch/jump target predictionsylt2016/08/19 11:00 AM
              Branch/jump target predictionMegol2016/08/21 09:27 AM
                The (apparent) state of trace caches on modern CPUsMaynard Handley2016/08/22 02:10 PM
                  The (apparent) state of trace caches on modern CPUsExophase2016/08/22 07:55 PM
                    The (apparent) state of trace caches on modern CPUsanon2016/08/22 11:36 PM
                      The (apparent) state of trace caches on modern CPUsExophase2016/08/23 04:08 AM
                        The (apparent) state of trace caches on modern CPUsanon2016/08/23 08:51 PM
                          The (apparent) state of trace caches on modern CPUsExophase2016/08/23 10:12 PM
                          The (apparent) state of trace caches on modern CPUsMaynard Handley2016/08/24 06:38 AM
                            The (apparent) state of trace caches on modern CPUsanon2016/08/24 07:26 PM
                    The (apparent) state of trace caches on modern CPUsMaynard Handley2016/08/23 06:48 AM
                      That's not trueDavid Kanter2016/08/23 08:39 AM
                        That's not trueMaynard Handley2016/08/23 08:56 AM
                      The (apparent) state of trace caches on modern CPUsanon2016/08/23 08:54 PM
                  The (wrong) state of trace caches on modern CPUsEric Bron2016/08/25 01:38 AM
                    The (wrong) state of trace caches on modern CPUsMichael S2016/08/25 02:28 AM
                      The (wrong) state of trace caches on modern CPUsEric Bron2016/08/25 06:12 AM
                      The (wrong) state of trace caches on modern CPUsMaynard Handley2016/08/25 08:50 AM
                        The (wrong) state of trace caches on modern CPUsMichael S2016/08/25 09:36 AM
                          The (wrong) state of trace caches on modern CPUsExophase2016/08/25 10:32 AM
                        The (wrong) state of trace caches on modern CPUsEric Bron2016/08/25 10:12 AM
                          The (wrong) state of trace caches on modern CPUsMaynard Handley2016/08/25 11:01 AM
                            The (wrong) state of trace caches on modern CPUsEric Bron2016/08/25 11:20 AM
                              The (wrong) state of trace caches on modern CPUsMaynard Handley2016/08/25 12:34 PM
        Branch/jump target predictionGabriele Svelto2016/08/11 12:15 PM
  Branch/jump target predictionGabriele Svelto2016/08/20 06:21 AM
Reply to this Topic
Body: No Text
How do you spell tangerine? 🍊