Branch/jump target prediction -8

By: Maynard Handley (name99.delete@this.name99.org), August 18, 2016 9:02 am
Room: Moderated Discussions
⚛ (0xe2.0x9a.0x9b.delete@this.gmail.com) on August 16, 2016 9:23 am wrote:
> Ricardo B (ricardo.b.delete@this.xxxxx.xx) on August 16, 2016 7:39 am wrote:
> > ⚛ (0xe2.0x9a.0x9b.delete@this.gmail.com) on August 16, 2016 6:40 am wrote:
> > >
> > > The outcome of a conditional branch depends on data (not on
> > > code). An instruction prefix can help branch prediction
> > > only if the prefix is mentioning a register or a memory location containing the prediction(s).
> >
> > You're suggesting to have a register/memory access in the path of the branch prediction logic?
> > That will just stall the fetch/decode pipeline.
> > That's like operating without branch prediction and just stalling
> > the pipeline until the branch instruction is retired.
>
> What would be an alternative option to improving branch prediction over Skylake's other
> than computing the prediction and somehow feeding it to the parts of CPU which need it?
>
> If a CPU needs to know the value of the register used by the conditional branch 8 cycles before the
> branch opcode enters the pipeline, then the compiler/programmer will emit branch-with-prediction-in-register
> only in case the register can be filled with information 8 cycles before the branch.
>
> An alternative option that comes to my mind is: Conditional jump after 8 instructions. That is, the conditional
> branch instruction enters the pipeline, the CPU postpones its execution, starts counting from 1 to 8, and after
> it executes/retires the 8th instruction it executes the postponed conditional branch instruction.
>
> Question: How many conditional branches in real-world code are predictable
> 8 cycles before the branch instruction enters the CPU pipeline?
>
> Linus believes (why?) that such a question is totally out of the question.

The issue is not the predictability of branches, it is that you seem to have a broken model of what's difficult on modern high performance CPUs, and thus how they operate.
One of the big problems on a modern high-end CPU is maintaining fetch bandwidth, ie feeding the core with instructions as fast as the core can use up those instructions. The numbers that matter are
- there's about a branch of some sort every six instructions or so. (Ideally less in modern code that uses csel or cmov appropriately, but still very often)
- you obviously want to sustain 4 to 6 to 8 instructions every cycle (depending on exactly which CPU you are)
- ideally you'd prefer not to have to bank or double-port your I-cache, so you can only pull in one line per cycle (meaning that some fraction of your lines will be partial because they start say 3 instructions before the end of the cache line)

Put this all together and you essentially HAVE to expect (at least) one branch per cycle in the fetch stream, which in turn means that you have fetch as this tiny little self-contained micro-computer all by itself; whose entire job is essentially to predict, for each cycle,
- what's the address from which I pull in instructions this cycle
- how many instructions do I pull in
To achieve this at least three predictors are used: the link stack, the branch target buffer (ie addresses of computed branches --- vtables, switches, proc ptrs), and the branch direction predictor for simple "branch if non-zero" type branches. Each of these has to generate a prediction within less than a cycle, one of the predictions has to be chosen as the winner, and the prediction has to be sent to the I-cache. Not easy.

Once you understand this, the difficulty in what you're suggesting becomes obvious. You seem to imagine that 8 cycles is a long time to compute some branch information and pass it to the fetch unit. Not even close. 8 cycles is less than the pipeline length from when your suggested instruction is fetched to when it executes --- ie it executes AFTER the prediction it's supposed to inform is needed! And that's assuming "immediate" execution; on the CPUs we are discussing, the instruction may actually sit in the dispatch queue for tens of cycles or more until it gets its chance to execute.

It is already the case that branch predictors only get feedback as to whether their predictions were accurate many cycles after the prediction was made --- and this is a fact of no little significance in tweaking them for optimal performance.

It's also not clear that your suggestion would provide much value. The fact is that most code sequence branches ARE now well predicted. Not perfectly, but well. Link stacks work very well. TAGE-class branch predictors work very well (well even on computed branches). And compilers are getting better about using csel/cmov for many of the difficult (data-dependent but very small manipulation branches) that remain (think max() or abs() type calculations).

What provides more value is fixing the remaining problems in this fetch pipeline. These might include things like
- fetching across a cache line boundary in one cycle

- detecting that the next upcoming branch will be fall through (and so can be ignored and the fetch count extended up to the next branch point)

- using both fast and higher accuracy branch predictors. The idea is that the fast predictor delivers, within one cycle, a result that is "usually" accurate which is acted upon. Simultaneously the slow predictor delivers (one or two cycles later, for the same "fetch cycle") a better prediction. If the two predictions match, great. If not, the "wrong" instructions in the fetch buffer are flushed and we refetch. Not ideal --- but all we've lost is one cycle instead of an entire pipeline flush. This allows us to have both a very low cycle time and almost all of the performance of a highly accurate predictor at the cost of an occasional one cycle hiccup in fetch.

As far as I know, IBM does all 3 of these on POWER8. I don't know about Intel or Apple.

Another thing you could imagine is also storing an indicator of branch confidence, and, for low confidence branches, stalling the pipeline until you are sure what the outcome will be. This sounds like a good idea that would save energy while probably not hurting performance, but I don't think it's implemented in any high performance CPU for the sorts of reasons I mentioned at the start of this post: it demands tighter feedback from the end of the pipeline to the very start than is easy to implement while keeping everything else as fast as possible (ie low cycle time, shortest possible stall time while waiting for the outcome to be resolved, fastest possible flush after control misprediction, etc).
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Branch/jump target predictionTravis2016/08/09 09:44 AM
  Early decode of unconditional jumpsPeter Cordes2016/08/09 11:35 AM
    Early decode of unconditional jumpsExophase2016/08/09 12:29 PM
  pipelines are too long, noHeikki Kultala2016/08/09 11:37 AM
    pipelines are too long, nono name2016/08/09 06:17 PM
      pipelines are too long, noWilco2016/08/10 01:43 AM
        pipelines are too long, noPaul A. Clayton2016/08/10 07:44 PM
    Converged BTB/IcachePaul A. Clayton2016/08/10 07:44 PM
  Branch/jump target predictionsylt2016/08/10 02:27 AM
    Branch/jump target predictionPeter Cordes2016/08/12 03:23 PM
      Branch/jump target predictionsylt2016/08/12 10:35 PM
  Branch/jump target predictionMr. Camel2016/08/10 09:43 AM
    Branch/jump target predictionLinus Torvalds2016/08/10 11:46 AM
      Branch/jump target predictionMegol2016/08/10 02:25 PM
        Branch/jump target predictionLinus Torvalds2016/08/10 04:14 PM
          Branch/jump target predictionDavid Kanter2016/08/11 11:09 PM
            Branch/jump target predictionLinus Torvalds2016/08/12 11:25 AM
          Branch/jump target prediction2016/08/14 04:24 AM
            Branch/jump target predictionMaynard Handley2016/08/14 06:47 AM
              Branch/jump target predictionDavid Kanter2016/08/14 07:13 AM
              Branch/jump target prediction2016/08/16 05:19 AM
            Branch/jump target predictionTim McCaffrey2016/08/14 07:12 AM
              Branch/jump target predictionDavid Kanter2016/08/14 07:18 AM
                Branch/jump target predictionGabriele Svelto2016/08/14 01:09 PM
            Just a thoughtAnon2016/08/14 09:40 AM
              Just a thought2016/08/16 05:58 AM
                Just a thoughtAnon2016/08/16 07:45 AM
                  Just a thought2016/08/16 08:36 AM
            Branch/jump target predictionLinus Torvalds2016/08/14 09:40 AM
              Branch/jump target prediction2016/08/16 05:40 AM
                Branch/jump target predictionRicardo B2016/08/16 06:39 AM
                  Branch/jump target prediction -82016/08/16 08:23 AM
                    Branch/jump target prediction -8anon2016/08/16 09:09 AM
                    Branch/jump target prediction -8Ricardo B2016/08/16 09:33 AM
                      Branch/jump target prediction -8Exophase2016/08/16 10:02 AM
                        Branch/jump target prediction -8Ricardo B2016/08/16 10:31 AM
                        SPU hbr instruction (hint for branch)vvid2016/08/16 11:31 AM
                        Branch/jump target prediction -8no name2016/08/17 07:16 AM
                    Branch/jump target prediction -8Gabriele Svelto2016/08/16 10:46 AM
                      Branch/jump target prediction -8Etienne2016/08/17 12:27 AM
                        Branch/jump target prediction -8Gabriele Svelto2016/08/17 02:52 AM
                    Branch/jump target prediction -8Maynard Handley2016/08/18 09:02 AM
                      Branch/jump target prediction -82016/08/18 05:21 PM
                        Branch/jump target prediction -8Maynard Handley2016/08/18 06:27 PM
                          Branch/jump target prediction -8Megol2016/08/19 03:29 AM
                          Part 1/N - CPU-internal JIT2016/08/19 03:44 AM
                        Atom, you're such a comedian.Jim Trent2016/08/18 09:39 PM
                          Atom, you're such a comedian.2016/08/19 02:23 AM
                      Branch/jump target prediction -8Etienne2016/08/19 12:25 AM
                        Branch/jump target prediction -8Simon Farnsworth2016/08/19 03:17 AM
                          Branch/jump target prediction -8Michael S2016/08/19 05:39 AM
                          Branch/jump target prediction -8anon2016/08/19 06:29 AM
                            Branch/jump target prediction -8Simon Farnsworth2016/08/19 07:34 AM
                              Branch/jump target prediction -8anon2016/08/19 07:48 AM
                                Branch/jump target prediction -8Exophase2016/08/19 10:03 AM
                                Branch/jump target prediction -8Maynard Handley2016/08/19 10:34 AM
                            Branch/jump target prediction -8David Kanter2016/08/19 11:23 PM
                        Branch/jump target prediction -8Ricardo B2016/08/19 06:18 AM
                          Branch/jump target prediction -8Maynard Handley2016/08/19 07:41 AM
                            Branch/jump target prediction -8Michael S2016/08/19 08:26 AM
                              Branch/jump target prediction -8Maynard Handley2016/08/19 12:47 PM
                                Branch/jump target prediction -8Michael S2016/08/21 12:53 AM
                                  Branch/jump target prediction -8Ricardo B2016/08/22 04:17 AM
                                    Branch/jump target prediction -8Michael S2016/08/22 04:58 AM
                                      Branch/jump target prediction -8Ricardo B2016/08/22 06:50 AM
                            Branch/jump target prediction -8Simon Farnsworth2016/08/19 08:28 AM
                              Branch/jump target prediction -8Simon Farnsworth2016/08/19 08:40 AM
                            Branch/jump target prediction -8David Kanter2016/08/22 11:05 PM
                              Branch/jump target prediction -8Maynard Handley2016/08/23 06:49 AM
                      Branch/jump target prediction -8anon2016/08/26 07:00 AM
                        Branch/jump target prediction -8anon2016/08/26 07:14 AM
                Branch/jump target predictionMegol2016/08/19 03:23 AM
          Branch/jump target predictionMegol2016/08/19 06:42 AM
            Branch/jump target predictionMaynard Handley2016/08/19 10:46 AM
              Branch/jump target predictionDavid Kanter2016/08/19 11:34 PM
                Branch/jump target predictionMaynard Handley2016/08/20 06:07 AM
            Branch/jump target predictionsylt2016/08/19 10:48 AM
              Branch/jump target predictionsylt2016/08/19 11:00 AM
              Branch/jump target predictionMegol2016/08/21 09:27 AM
                The (apparent) state of trace caches on modern CPUsMaynard Handley2016/08/22 02:10 PM
                  The (apparent) state of trace caches on modern CPUsExophase2016/08/22 07:55 PM
                    The (apparent) state of trace caches on modern CPUsanon2016/08/22 11:36 PM
                      The (apparent) state of trace caches on modern CPUsExophase2016/08/23 04:08 AM
                        The (apparent) state of trace caches on modern CPUsanon2016/08/23 08:51 PM
                          The (apparent) state of trace caches on modern CPUsExophase2016/08/23 10:12 PM
                          The (apparent) state of trace caches on modern CPUsMaynard Handley2016/08/24 06:38 AM
                            The (apparent) state of trace caches on modern CPUsanon2016/08/24 07:26 PM
                    The (apparent) state of trace caches on modern CPUsMaynard Handley2016/08/23 06:48 AM
                      That's not trueDavid Kanter2016/08/23 08:39 AM
                        That's not trueMaynard Handley2016/08/23 08:56 AM
                      The (apparent) state of trace caches on modern CPUsanon2016/08/23 08:54 PM
                  The (wrong) state of trace caches on modern CPUsEric Bron2016/08/25 01:38 AM
                    The (wrong) state of trace caches on modern CPUsMichael S2016/08/25 02:28 AM
                      The (wrong) state of trace caches on modern CPUsEric Bron2016/08/25 06:12 AM
                      The (wrong) state of trace caches on modern CPUsMaynard Handley2016/08/25 08:50 AM
                        The (wrong) state of trace caches on modern CPUsMichael S2016/08/25 09:36 AM
                          The (wrong) state of trace caches on modern CPUsExophase2016/08/25 10:32 AM
                        The (wrong) state of trace caches on modern CPUsEric Bron2016/08/25 10:12 AM
                          The (wrong) state of trace caches on modern CPUsMaynard Handley2016/08/25 11:01 AM
                            The (wrong) state of trace caches on modern CPUsEric Bron2016/08/25 11:20 AM
                              The (wrong) state of trace caches on modern CPUsMaynard Handley2016/08/25 12:34 PM
        Branch/jump target predictionGabriele Svelto2016/08/11 12:15 PM
  Branch/jump target predictionGabriele Svelto2016/08/20 06:21 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell tangerine? 🍊