Branch/jump target prediction

By: Maynard Handley (name99.delete@this.name99.org), August 20, 2016 6:07 am
Room: Moderated Discussions
David Kanter (dkanter.delete@this.realworldtech.com) on August 20, 2016 12:34 am wrote:
> Maynard Handley (name99.delete@this.name99.org) on August 19, 2016 11:46 am wrote:
> > Megol (golem960.delete@this.gmail.com) on August 19, 2016 7:42 am wrote:
> > > Linus Torvalds (torvalds.delete@this.linux-foundation.org) on August 10, 2016 5:14 pm
> > > Congratulations! Perhaps you will someday understand that saying trace caches aren't a
> > > bad concept isn't the same thing as saying the P4 had a good front-end. It hadn't and
> > > it isn't really relevant to this thread that doesn't start out as Pentium 4 worship!
> > >
> > > > So you're much better off doing just a L0 I$ predecoded cache on an
> > > > instruction boundary level, and forget entirely about the traces.
> > >
> > > That solves a mostly different problem than the trace cache.
> > >
> > > > > Perhaps you should look up what a trace cache is before stating things like that?
> > > >
> > > > Yeah, let's just imagine that I worked for a company that did very
> > > > similar things and actually generated traces on real loads.
> > >
> > > Similar things sure. But doing trace scheduling isn't the same thing as doing trace caches.
> > >
> > > Trace caches were created to increase fetch bandwidth for wide superscalar processors
> > > for realistic real-world code where branches are common. It is an alternative to things
> > > like multi-way branch predictors, collapsing instruction buffers etc.
> > >
> > > > In other words, I haven't just masturbated over academic
> > > > papers like you apparently do. I do know how they work.
> > >
> > > It is obvious that you don't know how they work - you don't
> > > even understand why they were created in the first place!
> > >
> > > If reading academic papers, verifying that they measure the correct things and building ones understanding
> > > on that is masturbation (I interpret that as "fucking around
> > > without real results" as any other would be puerile)
> > > I wonder what we should call the act of not understanding a topic, incorrectly thinking one have experience
> > > in the area and then in an act of ego-stroking loudly call the world to see this "expertise"?
> > >
> >
> > I suspect the real resolution here is that there may have been a narrow
> > window of time during which trace caches make sense, but no longer.
> > Look at the description I gave of modern fetch. That's basically an on-going dynamic
> > construction of traces. So why would statically constructed traces be better?
> > - the predictor used for the dynamic construction of traces will obviously be better than that
> > for static construction since it is constantly being updated (and even if the static trace
> > construction feels it can use multi-cycle predictors for really high quality prediction, as
> > I mentioned above, the same multi-cycle predictors can be used on a modern processor)
> > - obviously the trace cache takes up transistors and there are hassle issues in
> > terms of making sure it stays in sync with all the other processor/memory state
> > - there might theoretically be a power advantage to using traces, but even that, I suspect
> > is minimal. Most of your time is generally spent in loops, and a loop buffer can act
> > like a very minimal trace cache --- which does however handle the hot case
> >
> > So I suspect traces made sense during a period when people wanted to pull in lots of instructions
> > but weren't able to devote enough transistors to doing so properly (with a truly aggressive fetch
> > front-end). But those days are over --- we have plenty of resources to apply to fetch and static traces
> > are basically a poor cousin to what we're doing today. Maybe they'd be a good choice for an intermediate
> > low-end chip (ARM M4 sort of thing) in terms of the performance vs power tradeoff?
> > But for a high-end chip I don't see it.
>
> I don't think instruction delivery is close to solved at all. Trace
> caches are problematic, but fetch and decode is a huge problem.

Can you quantify this, David, or give some justification for the statement?

Suppose we take an A9X as baseline, so we want 6 instructions per cycle, and look at order of magnitude.
This means we have to
- correctly predict about one branch-event per cycle which is doable (to 9X% accuracy or so)
- pull in about 6 instructions=24 bytes (which is too tight on average --- it's what you just get if you have 64 byte lines, jump to a random point in the line, and can't load from two I-cache lines in one cycle)
But this CAN be fixed fairly easily if we want by banking the cache and making the line load state machine a little more complex.

If we want more than 6 per cycle (up to say around 9 or 10) then we have to handle two branches per cycle. The easiest way to do that is what I suggested in the original comment --- ALSO test in the same cycle if the second branch is not taken and if so fetch past it. You won't always get lucky, but you will a lot of the time, and with a nicely sized fetch buffer that's good enough.

Of course there's also the additional issue of I-cache misses,but again there are solutions (imperfect, like everything, but very helpful).
At the lowest end, you simply make your L2 aging/cast-out mechanism slightly smarter so that it preferentially throws away data over instructions (because a data miss is easier to work around than an I miss); or you provide a separate L2 I$.
At the mid-range you fix up your damn prefetch stats (and, for that matter, your in-core branch prediction stats) so that they IGNORE steering information that came from either interrupts or non-committed instruction flow --- since both simply add noise and no value.
At the high-end there are remarkably accurate I-prefetch predictors around (that work on server class code) with, however, the caveat that the tables they need require of order 2 to 4 MB :-( The thesis I read claimed, if I recall correctly, that if you were willing to essentially create a small auxiliary handler CPU living next to the real CPU, and give it some dedicated DRAM, then all the data could be juggled fast enough between DRAM and on-core to get this to work. If you're willing to go for slightly less accuracy and a lot easier implementation, you can get away with about 5% size increase for L2.

https://compas.cs.stonybrook.edu/~mferdman/downloads.php/MICRO08_Temporal_Instruction_Fetch_Streaming.pdf
(and see also Ferdman's PhD thesis).

So what is the context in which you say this is still a problem? I'm simply curious. As far as I can tell in real machines it is solved "enough" for now, there are better solutions in storage for when we do need them, and it's also basically moot until we deal with the real problem of data misses to DRAM.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Branch/jump target predictionTravis2016/08/09 09:44 AM
  Early decode of unconditional jumpsPeter Cordes2016/08/09 11:35 AM
    Early decode of unconditional jumpsExophase2016/08/09 12:29 PM
  pipelines are too long, noHeikki Kultala2016/08/09 11:37 AM
    pipelines are too long, nono name2016/08/09 06:17 PM
      pipelines are too long, noWilco2016/08/10 01:43 AM
        pipelines are too long, noPaul A. Clayton2016/08/10 07:44 PM
    Converged BTB/IcachePaul A. Clayton2016/08/10 07:44 PM
  Branch/jump target predictionsylt2016/08/10 02:27 AM
    Branch/jump target predictionPeter Cordes2016/08/12 03:23 PM
      Branch/jump target predictionsylt2016/08/12 10:35 PM
  Branch/jump target predictionMr. Camel2016/08/10 09:43 AM
    Branch/jump target predictionLinus Torvalds2016/08/10 11:46 AM
      Branch/jump target predictionMegol2016/08/10 02:25 PM
        Branch/jump target predictionLinus Torvalds2016/08/10 04:14 PM
          Branch/jump target predictionDavid Kanter2016/08/11 11:09 PM
            Branch/jump target predictionLinus Torvalds2016/08/12 11:25 AM
          Branch/jump target prediction2016/08/14 04:24 AM
            Branch/jump target predictionMaynard Handley2016/08/14 06:47 AM
              Branch/jump target predictionDavid Kanter2016/08/14 07:13 AM
              Branch/jump target prediction2016/08/16 05:19 AM
            Branch/jump target predictionTim McCaffrey2016/08/14 07:12 AM
              Branch/jump target predictionDavid Kanter2016/08/14 07:18 AM
                Branch/jump target predictionGabriele Svelto2016/08/14 01:09 PM
            Just a thoughtAnon2016/08/14 09:40 AM
              Just a thought2016/08/16 05:58 AM
                Just a thoughtAnon2016/08/16 07:45 AM
                  Just a thought2016/08/16 08:36 AM
            Branch/jump target predictionLinus Torvalds2016/08/14 09:40 AM
              Branch/jump target prediction2016/08/16 05:40 AM
                Branch/jump target predictionRicardo B2016/08/16 06:39 AM
                  Branch/jump target prediction -82016/08/16 08:23 AM
                    Branch/jump target prediction -8anon2016/08/16 09:09 AM
                    Branch/jump target prediction -8Ricardo B2016/08/16 09:33 AM
                      Branch/jump target prediction -8Exophase2016/08/16 10:02 AM
                        Branch/jump target prediction -8Ricardo B2016/08/16 10:31 AM
                        SPU hbr instruction (hint for branch)vvid2016/08/16 11:31 AM
                        Branch/jump target prediction -8no name2016/08/17 07:16 AM
                    Branch/jump target prediction -8Gabriele Svelto2016/08/16 10:46 AM
                      Branch/jump target prediction -8Etienne2016/08/17 12:27 AM
                        Branch/jump target prediction -8Gabriele Svelto2016/08/17 02:52 AM
                    Branch/jump target prediction -8Maynard Handley2016/08/18 09:02 AM
                      Branch/jump target prediction -82016/08/18 05:21 PM
                        Branch/jump target prediction -8Maynard Handley2016/08/18 06:27 PM
                          Branch/jump target prediction -8Megol2016/08/19 03:29 AM
                          Part 1/N - CPU-internal JIT2016/08/19 03:44 AM
                        Atom, you're such a comedian.Jim Trent2016/08/18 09:39 PM
                          Atom, you're such a comedian.2016/08/19 02:23 AM
                      Branch/jump target prediction -8Etienne2016/08/19 12:25 AM
                        Branch/jump target prediction -8Simon Farnsworth2016/08/19 03:17 AM
                          Branch/jump target prediction -8Michael S2016/08/19 05:39 AM
                          Branch/jump target prediction -8anon2016/08/19 06:29 AM
                            Branch/jump target prediction -8Simon Farnsworth2016/08/19 07:34 AM
                              Branch/jump target prediction -8anon2016/08/19 07:48 AM
                                Branch/jump target prediction -8Exophase2016/08/19 10:03 AM
                                Branch/jump target prediction -8Maynard Handley2016/08/19 10:34 AM
                            Branch/jump target prediction -8David Kanter2016/08/19 11:23 PM
                        Branch/jump target prediction -8Ricardo B2016/08/19 06:18 AM
                          Branch/jump target prediction -8Maynard Handley2016/08/19 07:41 AM
                            Branch/jump target prediction -8Michael S2016/08/19 08:26 AM
                              Branch/jump target prediction -8Maynard Handley2016/08/19 12:47 PM
                                Branch/jump target prediction -8Michael S2016/08/21 12:53 AM
                                  Branch/jump target prediction -8Ricardo B2016/08/22 04:17 AM
                                    Branch/jump target prediction -8Michael S2016/08/22 04:58 AM
                                      Branch/jump target prediction -8Ricardo B2016/08/22 06:50 AM
                            Branch/jump target prediction -8Simon Farnsworth2016/08/19 08:28 AM
                              Branch/jump target prediction -8Simon Farnsworth2016/08/19 08:40 AM
                            Branch/jump target prediction -8David Kanter2016/08/22 11:05 PM
                              Branch/jump target prediction -8Maynard Handley2016/08/23 06:49 AM
                      Branch/jump target prediction -8anon2016/08/26 07:00 AM
                        Branch/jump target prediction -8anon2016/08/26 07:14 AM
                Branch/jump target predictionMegol2016/08/19 03:23 AM
          Branch/jump target predictionMegol2016/08/19 06:42 AM
            Branch/jump target predictionMaynard Handley2016/08/19 10:46 AM
              Branch/jump target predictionDavid Kanter2016/08/19 11:34 PM
                Branch/jump target predictionMaynard Handley2016/08/20 06:07 AM
            Branch/jump target predictionsylt2016/08/19 10:48 AM
              Branch/jump target predictionsylt2016/08/19 11:00 AM
              Branch/jump target predictionMegol2016/08/21 09:27 AM
                The (apparent) state of trace caches on modern CPUsMaynard Handley2016/08/22 02:10 PM
                  The (apparent) state of trace caches on modern CPUsExophase2016/08/22 07:55 PM
                    The (apparent) state of trace caches on modern CPUsanon2016/08/22 11:36 PM
                      The (apparent) state of trace caches on modern CPUsExophase2016/08/23 04:08 AM
                        The (apparent) state of trace caches on modern CPUsanon2016/08/23 08:51 PM
                          The (apparent) state of trace caches on modern CPUsExophase2016/08/23 10:12 PM
                          The (apparent) state of trace caches on modern CPUsMaynard Handley2016/08/24 06:38 AM
                            The (apparent) state of trace caches on modern CPUsanon2016/08/24 07:26 PM
                    The (apparent) state of trace caches on modern CPUsMaynard Handley2016/08/23 06:48 AM
                      That's not trueDavid Kanter2016/08/23 08:39 AM
                        That's not trueMaynard Handley2016/08/23 08:56 AM
                      The (apparent) state of trace caches on modern CPUsanon2016/08/23 08:54 PM
                  The (wrong) state of trace caches on modern CPUsEric Bron2016/08/25 01:38 AM
                    The (wrong) state of trace caches on modern CPUsMichael S2016/08/25 02:28 AM
                      The (wrong) state of trace caches on modern CPUsEric Bron2016/08/25 06:12 AM
                      The (wrong) state of trace caches on modern CPUsMaynard Handley2016/08/25 08:50 AM
                        The (wrong) state of trace caches on modern CPUsMichael S2016/08/25 09:36 AM
                          The (wrong) state of trace caches on modern CPUsExophase2016/08/25 10:32 AM
                        The (wrong) state of trace caches on modern CPUsEric Bron2016/08/25 10:12 AM
                          The (wrong) state of trace caches on modern CPUsMaynard Handley2016/08/25 11:01 AM
                            The (wrong) state of trace caches on modern CPUsEric Bron2016/08/25 11:20 AM
                              The (wrong) state of trace caches on modern CPUsMaynard Handley2016/08/25 12:34 PM
        Branch/jump target predictionGabriele Svelto2016/08/11 12:15 PM
  Branch/jump target predictionGabriele Svelto2016/08/20 06:21 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell tangerine? 🍊