Branch/jump target prediction -8

By: anon (, August 26, 2016 7:14 am
Room: Moderated Discussions
anon ( on August 26, 2016 8:00 am wrote:
> Maynard Handley ( on August 18, 2016 10:02 am wrote:
> > ⚛ ( on August 16, 2016 9:23 am wrote:
> > > Ricardo B (ricardo.b.delete@this.xxxxx.xx) on August 16, 2016 7:39 am wrote:
> > > > ⚛ ( on August 16, 2016 6:40 am wrote:
> > > > >
> > > > > The outcome of a conditional branch depends on data (not on
> > > > > code). An instruction prefix can help branch prediction
> > > > > only if the prefix is mentioning a register or a memory location containing the prediction(s).
> > > >
> > > > You're suggesting to have a register/memory access in the path of the branch prediction logic?
> > > > That will just stall the fetch/decode pipeline.
> > > > That's like operating without branch prediction and just stalling
> > > > the pipeline until the branch instruction is retired.
> > >
> > > What would be an alternative option to improving branch prediction over Skylake's other
> > > than computing the prediction and somehow feeding it to the parts of CPU which need it?
> > >
> > > If a CPU needs to know the value of the register used by the conditional branch 8 cycles before the
> > > branch opcode enters the pipeline, then the compiler/programmer
> > > will emit branch-with-prediction-in-register
> > > only in case the register can be filled with information 8 cycles before the branch.
> > >
> > > An alternative option that comes to my mind is: Conditional
> > > jump after 8 instructions
. That is, the conditional
> > > branch instruction enters the pipeline, the CPU postpones
> > > its execution, starts counting from 1 to 8, and after
> > > it executes/retires the 8th instruction it executes the postponed conditional branch instruction.
> > >
> > > Question: How many conditional branches in real-world code are predictable
> > > 8 cycles before the branch instruction enters the CPU pipeline?
> > >
> > > Linus believes (why?) that such a question is totally out of the question.
> >
> > The issue is not the predictability of branches,
> > it is that you seem to have a broken model
> > of what's difficult on modern high performance CPUs, and thus how they operate.
> > One of the big problems on a modern high-end CPU is maintaining fetch bandwidth, ie feeding the core
> > with instructions as fast as the core can use up those instructions. The numbers that matter are
> > - there's about a branch of some sort every six instructions or so. (Ideally less
> > in modern code that uses csel or cmov appropriately, but still very often)
> > - you obviously want to sustain 4 to 6 to 8 instructions
> > every cycle (depending on exactly which CPU you are)
> > - ideally you'd prefer not to have to bank or double-port your I-cache, so you can
> > only pull in one line per cycle (meaning that some fraction of your lines will be
> > partial because they start say 3 instructions before the end of the cache line)
> >
> > Put this all together and you essentially HAVE to expect (at least) one branch per cycle in
> > the fetch stream, which in turn means that you have fetch as this tiny little self-contained
> > micro-computer all by itself; whose entire job is essentially to predict, for each cycle,
> > - what's the address from which I pull in instructions this cycle
> > - how many instructions do I pull in
> Most processors have a fixed fetch width (that is also aligned,
> although possibly to lesser than unit of width).

(Should say that of course on VLE like there can be be some quirks for exactly how fetch buffers are mapped and "width" might depend on instruction length, and of course some like Intel can be unaligned.)

> > To achieve this at least three predictors are used: the link stack, the branch target buffer (ie addresses
> > of computed branches --- vtables, switches, proc ptrs),
> > and the branch direction predictor for simple "branch
> > if non-zero" type branches. Each of these has to generate a prediction within less than a cycle, one of the
> > predictions has to be chosen as the winner, and the prediction has to be sent to the I-cache. Not easy.
> This is not true. First of all, it does not just get sent then come from icache each cycle. The icache
> latency alone will be several cycles. Instructions are taken into front end from fetch buffers.
> Second, the prediction to feed the front end can be generated on the previous
> cycle. It is the access / forwarding has to be sent to ifetch on this cycle.
> Third, not all predictors have to have a zero cycle access latency. It is not uncommon to have
> a small branch target address cache that is based on PC of previous cycle, and longer latency predictors
> that can cause flushes of part of front end but not back end in case they disagree.
> >
> > Once you understand this, the difficulty in what you're suggesting becomes obvious. You seem to imagine
> > that 8 cycles is a long time to compute some branch information and pass it to the fetch unit. Not
> > even close. 8 cycles is less than the pipeline length from when your suggested instruction is fetched
> > to when it executes --- ie it executes AFTER the prediction it's supposed to inform is needed! And
> > that's assuming "immediate" execution; on the CPUs we are discussing, the instruction may actually
> > sit in the dispatch queue for tens of cycles or more until it gets its chance to execute.
> >
> > It is already the case that branch predictors only get feedback as to whether their
> > predictions were accurate many cycles after the prediction was made --- and this
> > is a fact of no little significance in tweaking them for optimal performance.
> >
> > It's also not clear that your suggestion would provide much value. The fact is that most code
> > sequence branches ARE now well predicted. Not perfectly, but well. Link stacks work very well.
> > TAGE-class branch predictors work very well (well even on computed branches). And compilers
> > are getting better about using csel/cmov for many of the difficult (data-dependent but very
> > small manipulation branches) that remain (think max() or abs() type calculations).
> >
> > What provides more value is fixing the remaining problems
> > in this fetch pipeline.
> Do you have a source for these claims? Branch prediction is still one of *the*
> critical factors limiting usefulness and efficiency of speculation.
> > These might include things like
> > - fetching across a cache line boundary in one cycle
> "Fetch", as in take from ifetch buffer, yes. Load from icache I doubt.
> >
> > - detecting that the next upcoming branch will be fall through (and so can
> > be ignored and the fetch count extended up to the next branch point)
> What do you mean by this? Branch predictors already detect if a branch will be taken
> or not. What do you mean by "fetch count extending up to the next branch point"?
> >
> > - using both fast and higher accuracy branch predictors. The idea is that the fast predictor delivers,
> > within one cycle, a result that is "usually" accurate which is acted upon. Simultaneously the slow
> > predictor delivers (one or two cycles later, for the same "fetch cycle") a better prediction. If
> > the two predictions match, great. If not, the "wrong" instructions in the fetch buffer are flushed
> > and we refetch. Not ideal --- but all we've lost is one cycle instead of an entire pipeline flush.
> > This allows us to have both a very low cycle time and almost all of the performance of a highly
> > accurate predictor at the cost of an occasional one cycle hiccup in fetch.
> This technique is probably older than you are.
> >
> > As far as I know, IBM does all 3 of these on POWER8. I don't know about Intel or Apple.
> >
> > Another thing you could imagine is also storing an indicator of branch confidence, and, for low confidence
> > branches, stalling the pipeline until you are sure what the outcome will be. This sounds like a good
> > idea that would save energy while probably not hurting performance, but I don't think it's implemented
> > in any high performance CPU for the sorts of reasons I mentioned at the start of this post: it demands
> > tighter feedback from the end of the pipeline to the very start than is easy to implement while keeping
> > everything else as fast as possible (ie low cycle time, shortest possible stall time while waiting for
> > the outcome to be resolved, fastest possible flush after control misprediction, etc).
> Another amazing idea you should patent. I'm sure none of those idiot
> branch predictor designers ever thought of such a thing before.

< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Branch/jump target predictionTravis2016/08/09 09:44 AM
  Early decode of unconditional jumpsPeter Cordes2016/08/09 11:35 AM
    Early decode of unconditional jumpsExophase2016/08/09 12:29 PM
  pipelines are too long, noHeikki Kultala2016/08/09 11:37 AM
    pipelines are too long, nono name2016/08/09 06:17 PM
      pipelines are too long, noWilco2016/08/10 01:43 AM
        pipelines are too long, noPaul A. Clayton2016/08/10 07:44 PM
    Converged BTB/IcachePaul A. Clayton2016/08/10 07:44 PM
  Branch/jump target predictionsylt2016/08/10 02:27 AM
    Branch/jump target predictionPeter Cordes2016/08/12 03:23 PM
      Branch/jump target predictionsylt2016/08/12 10:35 PM
  Branch/jump target predictionMr. Camel2016/08/10 09:43 AM
    Branch/jump target predictionLinus Torvalds2016/08/10 11:46 AM
      Branch/jump target predictionMegol2016/08/10 02:25 PM
        Branch/jump target predictionLinus Torvalds2016/08/10 04:14 PM
          Branch/jump target predictionDavid Kanter2016/08/11 11:09 PM
            Branch/jump target predictionLinus Torvalds2016/08/12 11:25 AM
          Branch/jump target prediction2016/08/14 04:24 AM
            Branch/jump target predictionMaynard Handley2016/08/14 06:47 AM
              Branch/jump target predictionDavid Kanter2016/08/14 07:13 AM
              Branch/jump target prediction2016/08/16 05:19 AM
            Branch/jump target predictionTim McCaffrey2016/08/14 07:12 AM
              Branch/jump target predictionDavid Kanter2016/08/14 07:18 AM
                Branch/jump target predictionGabriele Svelto2016/08/14 01:09 PM
            Just a thoughtAnon2016/08/14 09:40 AM
              Just a thought2016/08/16 05:58 AM
                Just a thoughtAnon2016/08/16 07:45 AM
                  Just a thought2016/08/16 08:36 AM
            Branch/jump target predictionLinus Torvalds2016/08/14 09:40 AM
              Branch/jump target prediction2016/08/16 05:40 AM
                Branch/jump target predictionRicardo B2016/08/16 06:39 AM
                  Branch/jump target prediction -82016/08/16 08:23 AM
                    Branch/jump target prediction -8anon2016/08/16 09:09 AM
                    Branch/jump target prediction -8Ricardo B2016/08/16 09:33 AM
                      Branch/jump target prediction -8Exophase2016/08/16 10:02 AM
                        Branch/jump target prediction -8Ricardo B2016/08/16 10:31 AM
                        SPU hbr instruction (hint for branch)vvid2016/08/16 11:31 AM
                        Branch/jump target prediction -8no name2016/08/17 07:16 AM
                    Branch/jump target prediction -8Gabriele Svelto2016/08/16 10:46 AM
                      Branch/jump target prediction -8Etienne2016/08/17 12:27 AM
                        Branch/jump target prediction -8Gabriele Svelto2016/08/17 02:52 AM
                    Branch/jump target prediction -8Maynard Handley2016/08/18 09:02 AM
                      Branch/jump target prediction -82016/08/18 05:21 PM
                        Branch/jump target prediction -8Maynard Handley2016/08/18 06:27 PM
                          Branch/jump target prediction -8Megol2016/08/19 03:29 AM
                          Part 1/N - CPU-internal JIT2016/08/19 03:44 AM
                        Atom, you're such a comedian.Jim Trent2016/08/18 09:39 PM
                          Atom, you're such a comedian.2016/08/19 02:23 AM
                      Branch/jump target prediction -8Etienne2016/08/19 12:25 AM
                        Branch/jump target prediction -8Simon Farnsworth2016/08/19 03:17 AM
                          Branch/jump target prediction -8Michael S2016/08/19 05:39 AM
                          Branch/jump target prediction -8anon2016/08/19 06:29 AM
                            Branch/jump target prediction -8Simon Farnsworth2016/08/19 07:34 AM
                              Branch/jump target prediction -8anon2016/08/19 07:48 AM
                                Branch/jump target prediction -8Exophase2016/08/19 10:03 AM
                                Branch/jump target prediction -8Maynard Handley2016/08/19 10:34 AM
                            Branch/jump target prediction -8David Kanter2016/08/19 11:23 PM
                        Branch/jump target prediction -8Ricardo B2016/08/19 06:18 AM
                          Branch/jump target prediction -8Maynard Handley2016/08/19 07:41 AM
                            Branch/jump target prediction -8Michael S2016/08/19 08:26 AM
                              Branch/jump target prediction -8Maynard Handley2016/08/19 12:47 PM
                                Branch/jump target prediction -8Michael S2016/08/21 12:53 AM
                                  Branch/jump target prediction -8Ricardo B2016/08/22 04:17 AM
                                    Branch/jump target prediction -8Michael S2016/08/22 04:58 AM
                                      Branch/jump target prediction -8Ricardo B2016/08/22 06:50 AM
                            Branch/jump target prediction -8Simon Farnsworth2016/08/19 08:28 AM
                              Branch/jump target prediction -8Simon Farnsworth2016/08/19 08:40 AM
                            Branch/jump target prediction -8David Kanter2016/08/22 11:05 PM
                              Branch/jump target prediction -8Maynard Handley2016/08/23 06:49 AM
                      Branch/jump target prediction -8anon2016/08/26 07:00 AM
                        Branch/jump target prediction -8anon2016/08/26 07:14 AM
                Branch/jump target predictionMegol2016/08/19 03:23 AM
          Branch/jump target predictionMegol2016/08/19 06:42 AM
            Branch/jump target predictionMaynard Handley2016/08/19 10:46 AM
              Branch/jump target predictionDavid Kanter2016/08/19 11:34 PM
                Branch/jump target predictionMaynard Handley2016/08/20 06:07 AM
            Branch/jump target predictionsylt2016/08/19 10:48 AM
              Branch/jump target predictionsylt2016/08/19 11:00 AM
              Branch/jump target predictionMegol2016/08/21 09:27 AM
                The (apparent) state of trace caches on modern CPUsMaynard Handley2016/08/22 02:10 PM
                  The (apparent) state of trace caches on modern CPUsExophase2016/08/22 07:55 PM
                    The (apparent) state of trace caches on modern CPUsanon2016/08/22 11:36 PM
                      The (apparent) state of trace caches on modern CPUsExophase2016/08/23 04:08 AM
                        The (apparent) state of trace caches on modern CPUsanon2016/08/23 08:51 PM
                          The (apparent) state of trace caches on modern CPUsExophase2016/08/23 10:12 PM
                          The (apparent) state of trace caches on modern CPUsMaynard Handley2016/08/24 06:38 AM
                            The (apparent) state of trace caches on modern CPUsanon2016/08/24 07:26 PM
                    The (apparent) state of trace caches on modern CPUsMaynard Handley2016/08/23 06:48 AM
                      That's not trueDavid Kanter2016/08/23 08:39 AM
                        That's not trueMaynard Handley2016/08/23 08:56 AM
                      The (apparent) state of trace caches on modern CPUsanon2016/08/23 08:54 PM
                  The (wrong) state of trace caches on modern CPUsEric Bron2016/08/25 01:38 AM
                    The (wrong) state of trace caches on modern CPUsMichael S2016/08/25 02:28 AM
                      The (wrong) state of trace caches on modern CPUsEric Bron2016/08/25 06:12 AM
                      The (wrong) state of trace caches on modern CPUsMaynard Handley2016/08/25 08:50 AM
                        The (wrong) state of trace caches on modern CPUsMichael S2016/08/25 09:36 AM
                          The (wrong) state of trace caches on modern CPUsExophase2016/08/25 10:32 AM
                        The (wrong) state of trace caches on modern CPUsEric Bron2016/08/25 10:12 AM
                          The (wrong) state of trace caches on modern CPUsMaynard Handley2016/08/25 11:01 AM
                            The (wrong) state of trace caches on modern CPUsEric Bron2016/08/25 11:20 AM
                              The (wrong) state of trace caches on modern CPUsMaynard Handley2016/08/25 12:34 PM
        Branch/jump target predictionGabriele Svelto2016/08/11 12:15 PM
  Branch/jump target predictionGabriele Svelto2016/08/20 06:21 AM
Reply to this Topic
Body: No Text
How do you spell tangerine? 🍊