Branch/jump target prediction

By: Maynard Handley (, August 14, 2016 6:47 am
Room: Moderated Discussions
⚛ ( on August 14, 2016 5:24 am wrote:
> Linus Torvalds ( on August 10, 2016 5:14 pm wrote:
> > Megol ( on August 10, 2016 3:25 pm wrote:
> > >
> > > Can't argue against that, however the problems were mostly elsewhere.
> >
> > I agree that a lot of P4 weaknesses were exacerbated by other issues, and that
> > the legacy decoders were too weak. But the legacy decoders were too weak partly
> > because people had thought that trace caches were a good idea. They aren't.
> In terms of code specialization according to data, P4 trace cache
> is too primitive to provide measurable performance gains.
> > > > And it has almost nothing in common with the crap that was the P4 trace cache.
> If Intel implemented "P4 trace cache" correctly there would be less need for PyPy.
> Intel didn't demonstrate 2x speedups even for trivial Python computational loops and was
> (is) totally incapable of realizing that this *is* what they should be targeting.
> Transmeta also didn't demonstrate 2x speedups even for trivial Python computational loops. If
> you think this is a personal attack on your ability to grasp what should be have been done while
> you were at Transmeta to make 2x speedup happen at *least* for Python code, then yes, it certainly
> is. And given what you are writing around here, you still aren't getting it.
> > > So why mention it?
> >
> > .. because the predecode cache is the correct way to do this, and makes the trace cache pointless.
> uop cache in Skylake is still too primitive to speedup [anything
> having obvious computational redundancy] by a large margin.
> > So the BSD is very much relevant to the discussion - as a "look, here's something that actually
> > works better, and that Intel does that largely replaces the broken trace cache".
> >
> > > What kind of workloads do you run where instruction cache coherency is problematic?
> >
> > Umm. Like almost all of them?
> >
> > Do you realize how bad the P4 was at coherency? To the point that compiler-generated
> > code that didn't actually do self-modifying things at all had huge problems,
> > just because the coherence "solution" that Intel picked sucked.
> Yes, in hindsight, Intel didn't know what they were doing around the time they designed
> P4 trace cache. They did correctly guess the general direction though.
> In hindsight we can see the errors made more easily, so I am not claiming
> I would be able to provide a better P4 at that time in real-time.
> > Yeah, it's less of an issue on architectures that don't actually need coherency in the first place,
> > but that wasn't what was discussed. What was claimed was that the P4 trace cache was "awesome".
> >
> > It really really wasn't.
> >
> > > Really...
> >
> > Really. Trust me. Compilers had to be changed because of it.
> >
> > Yes, you can argue that that was due to another bad implementation
> > issue, but the oddity comes almost directly
> > from the fact that coherence gets more complicated, so then you do odd/bad things to simplify the problems.
> >
> > So the coherency issues were pretty much caused by the trace cache. The
> > fact is, trace caches need more care and complexity in this area.
> >
> > > Most branches _are_ very predictable, for those that aren't -> don't create a trace.
> > >Fixed.
> >
> > Bullshit.
> >
> > You don't know which branches are predictable to begin with.
> You know which branches are predictable from the programmer and/or from the previous
> runs. You "just" need to save the information and reuse it correctly.
> The fact is x86 ISA has no programmer-visible mechanism for saving such information.
> > Also, even the "very
> > predictable" ones tend to be about 99%, which isn't actually that predictable
> > after all - it causes problems when you end up having code overlap anyway.
> >
> > And btw, those benchmarks that show how predictable branches are? Yeah, they
> > aren't really all that indicative of real code that people actually run.
> >
> > It all boils down to the fact that you basically need to have the non-trace-cache case execute pretty
> > much as quickly as the trace case, and the whole trace cache ends up being a lot of complexity for
> > very little advantage. You can't actually try to skimp on the "legacy" decoders after all.
> It is *impossible* for the 1st-run case to execute as quickly as the subsequent-runs case.
> It follows from data compression and the equivalence between code and data that it
> is impossible for the 1st-run case to execute as quickly as the subsequent-runs case.
> (I know you still aren't getting any closer to understanding this - but that won't
> stop you from mentioning "masturbation over academic papers" in your response.)
> If there isn't at least a 2x performance difference between executing the
> 1st-run case and the subsequent-runs case then it's simply the wrong way.
> > So you're much better off doing just a L0 I$ predecoded cache on an
> > instruction boundary level, and forget entirely about the traces.
> That's not even scratching the surface.

I like beating up on Linus and/or x86 as much as the next person, but this rant is beyond the pale. You allude to a whole bunch of issues (value of trace cache, existence of instructions that persist profiling data IN the CPU, etc) that have no existence whatever in the commercial world, and accept your sneering to be good enough for us to believe you.
Extraordinary claims require extraordinary evidence, meaning that if you're going to claim that the entire mainstream world (i.e. Intel, all of ARM, IBM, Oracle, etc) are all too stupid to see the value of trace caches does right, you have an obligation to provide us with at least some evidence that you are correct.

To take just one issue below, given the memory wall, does it MATTER that Intel cannot fetch and decode 8 instructions per cycle? Everything seems to indicate that until someone builds a functional kilo-instruction processor, we're close, at around 2/cycle, to the maximum feasible IPC (so basically, tough order of magnitude) 4 ops/cycle on the 50% compute cycles, zero on the 50% wait cycles).

One reason the A10 (and A11) are so interesting is precisely to see how much better Apple can get with IPC than Intel, given a nicer ISA and a more precisely defined target... Will they hit the same pathetic few percent a year now, or will they be first to a KIP? (Or are all the simulations that say IPC maxes out around 2 for non KIP processors wrong?)

> For crying out loud, are you truly incapable of understanding the unfortunate significance of
> Skylake not being able to do 8 L1 cache reads per cycle per core even in the *best case* ?!?
> -Atom
> > > Perhaps you should look up what a trace cache is before stating things like that?
> >
> > Yeah, let's just imagine that I worked for a company that did very
> > similar things and actually generated traces on real loads.
> >
> > In other words, I haven't just masturbated over academic
> > papers like you apparently do. I do know how they work.
> >
> > They suck.
> >
> > Linus

< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Branch/jump target predictionTravis2016/08/09 09:44 AM
  Early decode of unconditional jumpsPeter Cordes2016/08/09 11:35 AM
    Early decode of unconditional jumpsExophase2016/08/09 12:29 PM
  pipelines are too long, noHeikki Kultala2016/08/09 11:37 AM
    pipelines are too long, nono name2016/08/09 06:17 PM
      pipelines are too long, noWilco2016/08/10 01:43 AM
        pipelines are too long, noPaul A. Clayton2016/08/10 07:44 PM
    Converged BTB/IcachePaul A. Clayton2016/08/10 07:44 PM
  Branch/jump target predictionsylt2016/08/10 02:27 AM
    Branch/jump target predictionPeter Cordes2016/08/12 03:23 PM
      Branch/jump target predictionsylt2016/08/12 10:35 PM
  Branch/jump target predictionMr. Camel2016/08/10 09:43 AM
    Branch/jump target predictionLinus Torvalds2016/08/10 11:46 AM
      Branch/jump target predictionMegol2016/08/10 02:25 PM
        Branch/jump target predictionLinus Torvalds2016/08/10 04:14 PM
          Branch/jump target predictionDavid Kanter2016/08/11 11:09 PM
            Branch/jump target predictionLinus Torvalds2016/08/12 11:25 AM
          Branch/jump target prediction2016/08/14 04:24 AM
            Branch/jump target predictionMaynard Handley2016/08/14 06:47 AM
              Branch/jump target predictionDavid Kanter2016/08/14 07:13 AM
              Branch/jump target prediction2016/08/16 05:19 AM
            Branch/jump target predictionTim McCaffrey2016/08/14 07:12 AM
              Branch/jump target predictionDavid Kanter2016/08/14 07:18 AM
                Branch/jump target predictionGabriele Svelto2016/08/14 01:09 PM
            Just a thoughtAnon2016/08/14 09:40 AM
              Just a thought2016/08/16 05:58 AM
                Just a thoughtAnon2016/08/16 07:45 AM
                  Just a thought2016/08/16 08:36 AM
            Branch/jump target predictionLinus Torvalds2016/08/14 09:40 AM
              Branch/jump target prediction2016/08/16 05:40 AM
                Branch/jump target predictionRicardo B2016/08/16 06:39 AM
                  Branch/jump target prediction -82016/08/16 08:23 AM
                    Branch/jump target prediction -8anon2016/08/16 09:09 AM
                    Branch/jump target prediction -8Ricardo B2016/08/16 09:33 AM
                      Branch/jump target prediction -8Exophase2016/08/16 10:02 AM
                        Branch/jump target prediction -8Ricardo B2016/08/16 10:31 AM
                        SPU hbr instruction (hint for branch)vvid2016/08/16 11:31 AM
                        Branch/jump target prediction -8no name2016/08/17 07:16 AM
                    Branch/jump target prediction -8Gabriele Svelto2016/08/16 10:46 AM
                      Branch/jump target prediction -8Etienne2016/08/17 12:27 AM
                        Branch/jump target prediction -8Gabriele Svelto2016/08/17 02:52 AM
                    Branch/jump target prediction -8Maynard Handley2016/08/18 09:02 AM
                      Branch/jump target prediction -82016/08/18 05:21 PM
                        Branch/jump target prediction -8Maynard Handley2016/08/18 06:27 PM
                          Branch/jump target prediction -8Megol2016/08/19 03:29 AM
                          Part 1/N - CPU-internal JIT2016/08/19 03:44 AM
                        Atom, you're such a comedian.Jim Trent2016/08/18 09:39 PM
                          Atom, you're such a comedian.2016/08/19 02:23 AM
                      Branch/jump target prediction -8Etienne2016/08/19 12:25 AM
                        Branch/jump target prediction -8Simon Farnsworth2016/08/19 03:17 AM
                          Branch/jump target prediction -8Michael S2016/08/19 05:39 AM
                          Branch/jump target prediction -8anon2016/08/19 06:29 AM
                            Branch/jump target prediction -8Simon Farnsworth2016/08/19 07:34 AM
                              Branch/jump target prediction -8anon2016/08/19 07:48 AM
                                Branch/jump target prediction -8Exophase2016/08/19 10:03 AM
                                Branch/jump target prediction -8Maynard Handley2016/08/19 10:34 AM
                            Branch/jump target prediction -8David Kanter2016/08/19 11:23 PM
                        Branch/jump target prediction -8Ricardo B2016/08/19 06:18 AM
                          Branch/jump target prediction -8Maynard Handley2016/08/19 07:41 AM
                            Branch/jump target prediction -8Michael S2016/08/19 08:26 AM
                              Branch/jump target prediction -8Maynard Handley2016/08/19 12:47 PM
                                Branch/jump target prediction -8Michael S2016/08/21 12:53 AM
                                  Branch/jump target prediction -8Ricardo B2016/08/22 04:17 AM
                                    Branch/jump target prediction -8Michael S2016/08/22 04:58 AM
                                      Branch/jump target prediction -8Ricardo B2016/08/22 06:50 AM
                            Branch/jump target prediction -8Simon Farnsworth2016/08/19 08:28 AM
                              Branch/jump target prediction -8Simon Farnsworth2016/08/19 08:40 AM
                            Branch/jump target prediction -8David Kanter2016/08/22 11:05 PM
                              Branch/jump target prediction -8Maynard Handley2016/08/23 06:49 AM
                      Branch/jump target prediction -8anon2016/08/26 07:00 AM
                        Branch/jump target prediction -8anon2016/08/26 07:14 AM
                Branch/jump target predictionMegol2016/08/19 03:23 AM
          Branch/jump target predictionMegol2016/08/19 06:42 AM
            Branch/jump target predictionMaynard Handley2016/08/19 10:46 AM
              Branch/jump target predictionDavid Kanter2016/08/19 11:34 PM
                Branch/jump target predictionMaynard Handley2016/08/20 06:07 AM
            Branch/jump target predictionsylt2016/08/19 10:48 AM
              Branch/jump target predictionsylt2016/08/19 11:00 AM
              Branch/jump target predictionMegol2016/08/21 09:27 AM
                The (apparent) state of trace caches on modern CPUsMaynard Handley2016/08/22 02:10 PM
                  The (apparent) state of trace caches on modern CPUsExophase2016/08/22 07:55 PM
                    The (apparent) state of trace caches on modern CPUsanon2016/08/22 11:36 PM
                      The (apparent) state of trace caches on modern CPUsExophase2016/08/23 04:08 AM
                        The (apparent) state of trace caches on modern CPUsanon2016/08/23 08:51 PM
                          The (apparent) state of trace caches on modern CPUsExophase2016/08/23 10:12 PM
                          The (apparent) state of trace caches on modern CPUsMaynard Handley2016/08/24 06:38 AM
                            The (apparent) state of trace caches on modern CPUsanon2016/08/24 07:26 PM
                    The (apparent) state of trace caches on modern CPUsMaynard Handley2016/08/23 06:48 AM
                      That's not trueDavid Kanter2016/08/23 08:39 AM
                        That's not trueMaynard Handley2016/08/23 08:56 AM
                      The (apparent) state of trace caches on modern CPUsanon2016/08/23 08:54 PM
                  The (wrong) state of trace caches on modern CPUsEric Bron2016/08/25 01:38 AM
                    The (wrong) state of trace caches on modern CPUsMichael S2016/08/25 02:28 AM
                      The (wrong) state of trace caches on modern CPUsEric Bron2016/08/25 06:12 AM
                      The (wrong) state of trace caches on modern CPUsMaynard Handley2016/08/25 08:50 AM
                        The (wrong) state of trace caches on modern CPUsMichael S2016/08/25 09:36 AM
                          The (wrong) state of trace caches on modern CPUsExophase2016/08/25 10:32 AM
                        The (wrong) state of trace caches on modern CPUsEric Bron2016/08/25 10:12 AM
                          The (wrong) state of trace caches on modern CPUsMaynard Handley2016/08/25 11:01 AM
                            The (wrong) state of trace caches on modern CPUsEric Bron2016/08/25 11:20 AM
                              The (wrong) state of trace caches on modern CPUsMaynard Handley2016/08/25 12:34 PM
        Branch/jump target predictionGabriele Svelto2016/08/11 12:15 PM
  Branch/jump target predictionGabriele Svelto2016/08/20 06:21 AM
Reply to this Topic
Body: No Text
How do you spell tangerine? 🍊