Branch/jump target prediction

By: David Kanter (dkanter.delete@this.realworldtech.com), August 14, 2016 7:13 am
Room: Moderated Discussions
Maynard Handley (name99.delete@this.name99.org) on August 14, 2016 7:47 am wrote:
> ⚛ (0xe2.0x9a.0x9b.delete@this.gmail.com) on August 14, 2016 5:24 am wrote:
> > Linus Torvalds (torvalds.delete@this.linux-foundation.org) on August 10, 2016 5:14 pm wrote:
> > > Megol (golem960.delete@this.gmail.com) on August 10, 2016 3:25 pm wrote:
> > > >
> > > > Can't argue against that, however the problems were mostly elsewhere.
> > >
> > > I agree that a lot of P4 weaknesses were exacerbated by other issues, and that
> > > the legacy decoders were too weak. But the legacy decoders were too weak partly
> > > because people had thought that trace caches were a good idea. They aren't.
> >
> > In terms of code specialization according to data, P4 trace cache
> > is too primitive to provide measurable performance gains.
> >
> > > > > And it has almost nothing in common with the crap that was the P4 trace cache.
> >
> > If Intel implemented "P4 trace cache" correctly there would be less need for PyPy.
> >
> > Intel didn't demonstrate 2x speedups even for trivial Python computational loops and was
> > (is) totally incapable of realizing that this *is* what they should be targeting.
> >
> > Transmeta also didn't demonstrate 2x speedups even for trivial Python computational loops. If
> > you think this is a personal attack on your ability to grasp what should be have been done while
> > you were at Transmeta to make 2x speedup happen at *least* for Python code, then yes, it certainly
> > is. And given what you are writing around here, you still aren't getting it.
> >
> > > > So why mention it?
> > >
> > > .. because the predecode cache is the correct way to do this, and makes the trace cache pointless.
> >
> > uop cache in Skylake is still too primitive to speedup [anything
> > having obvious computational redundancy] by a large margin.
> >
> > > So the BSD is very much relevant to the discussion - as a "look, here's something that actually
> > > works better, and that Intel does that largely replaces the broken trace cache".
> > >
> > > > What kind of workloads do you run where instruction cache coherency is problematic?
> > >
> > > Umm. Like almost all of them?
> > >
> > > Do you realize how bad the P4 was at coherency? To the point that compiler-generated
> > > code that didn't actually do self-modifying things at all had huge problems,
> > > just because the coherence "solution" that Intel picked sucked.
> >
> > Yes, in hindsight, Intel didn't know what they were doing around the time they designed
> > P4 trace cache. They did correctly guess the general direction though.
> >
> > In hindsight we can see the errors made more easily, so I am not claiming
> > I would be able to provide a better P4 at that time in real-time.
> >
> > > Yeah, it's less of an issue on architectures that don't actually need coherency in the first place,
> > > but that wasn't what was discussed. What was claimed was that the P4 trace cache was "awesome".
> > >
> > > It really really wasn't.
> > >
> > > > Really...
> > >
> > > Really. Trust me. Compilers had to be changed because of it.
> > >
> > > Yes, you can argue that that was due to another bad implementation
> > > issue, but the oddity comes almost directly
> > > from the fact that coherence gets more complicated, so then you do odd/bad things to simplify the problems.
> > >
> > > So the coherency issues were pretty much caused by the trace cache. The
> > > fact is, trace caches need more care and complexity in this area.
> > >
> > > > Most branches _are_ very predictable, for those that aren't -> don't create a trace.
> > > >Fixed.
> > >
> > > Bullshit.
> > >
> > > You don't know which branches are predictable to begin with.
> >
> > You know which branches are predictable from the programmer and/or from the previous
> > runs. You "just" need to save the information and reuse it correctly.
> >
> > The fact is x86 ISA has no programmer-visible mechanism for saving such information.
> >
> > > Also, even the "very
> > > predictable" ones tend to be about 99%, which isn't actually that predictable
> > > after all - it causes problems when you end up having code overlap anyway.
> > >
> > > And btw, those benchmarks that show how predictable branches are? Yeah, they
> > > aren't really all that indicative of real code that people actually run.
> > >
> > > It all boils down to the fact that you basically need to have the non-trace-cache case execute pretty
> > > much as quickly as the trace case, and the whole trace cache ends up being a lot of complexity for
> > > very little advantage. You can't actually try to skimp on the "legacy" decoders after all.
> >
> > It is *impossible* for the 1st-run case to execute as quickly as the subsequent-runs case.
> >
> > It follows from data compression and the equivalence between code and data that it
> > is impossible for the 1st-run case to execute as quickly as the subsequent-runs case.
> > (I know you still aren't getting any closer to understanding this - but that won't
> > stop you from mentioning "masturbation over academic papers" in your response.)
> >
> > If there isn't at least a 2x performance difference between executing the
> > 1st-run case and the subsequent-runs case then it's simply the wrong way.
> >
> > > So you're much better off doing just a L0 I$ predecoded cache on an
> > > instruction boundary level, and forget entirely about the traces.
> >
> > That's not even scratching the surface.
>
>
> I like beating up on Linus and/or x86 as much as the next person, but this rant is beyond
> the pale. You allude to a whole bunch of issues (value of trace cache, existence of instructions
> that persist profiling data IN the CPU, etc) that have no existence whatever in the commercial
> world, and accept your sneering to be good enough for us to believe you.

He's hardly crazy. JITs do a lot of this stuff.

> Extraordinary claims require extraordinary evidence, meaning that if you're going to claim that the entire
> mainstream world (i.e. Intel, all of ARM, IBM, Oracle, etc) are all too stupid to see the value of trace caches
> does right, you have an obligation to provide us with at least some evidence that you are correct.

> To take just one issue below, given the memory wall, does it MATTER that Intel cannot fetch and decode
> 8 instructions per cycle? Everything seems to indicate that until someone builds a functional kilo-instruction
> processor, we're close, at around 2/cycle, to the maximum feasible IPC (so basically, tough order of
> magnitude) 4 ops/cycle on the 50% compute cycles, zero on the 50% wait cycles).

Actually, it seems pretty clear from Intel's designs that you can hit ~8-10 uops/cycle peak. But obviously the challenge is improving all those 0s.

> One reason the A10 (and A11) are so interesting is precisely to see how much better Apple
> can get with IPC than Intel, given a nicer ISA and a more precisely defined target...

ISA makes no difference, specifically targeting mobile makes a big difference. They can also play around with the memory/storage hierarchy more. As you (or others) are wont to remind us, HDD-->SSD is a huge performance difference that is bigger than any recent CPU upgrade.

David
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Branch/jump target predictionTravis2016/08/09 09:44 AM
  Early decode of unconditional jumpsPeter Cordes2016/08/09 11:35 AM
    Early decode of unconditional jumpsExophase2016/08/09 12:29 PM
  pipelines are too long, noHeikki Kultala2016/08/09 11:37 AM
    pipelines are too long, nono name2016/08/09 06:17 PM
      pipelines are too long, noWilco2016/08/10 01:43 AM
        pipelines are too long, noPaul A. Clayton2016/08/10 07:44 PM
    Converged BTB/IcachePaul A. Clayton2016/08/10 07:44 PM
  Branch/jump target predictionsylt2016/08/10 02:27 AM
    Branch/jump target predictionPeter Cordes2016/08/12 03:23 PM
      Branch/jump target predictionsylt2016/08/12 10:35 PM
  Branch/jump target predictionMr. Camel2016/08/10 09:43 AM
    Branch/jump target predictionLinus Torvalds2016/08/10 11:46 AM
      Branch/jump target predictionMegol2016/08/10 02:25 PM
        Branch/jump target predictionLinus Torvalds2016/08/10 04:14 PM
          Branch/jump target predictionDavid Kanter2016/08/11 11:09 PM
            Branch/jump target predictionLinus Torvalds2016/08/12 11:25 AM
          Branch/jump target prediction2016/08/14 04:24 AM
            Branch/jump target predictionMaynard Handley2016/08/14 06:47 AM
              Branch/jump target predictionDavid Kanter2016/08/14 07:13 AM
              Branch/jump target prediction2016/08/16 05:19 AM
            Branch/jump target predictionTim McCaffrey2016/08/14 07:12 AM
              Branch/jump target predictionDavid Kanter2016/08/14 07:18 AM
                Branch/jump target predictionGabriele Svelto2016/08/14 01:09 PM
            Just a thoughtAnon2016/08/14 09:40 AM
              Just a thought2016/08/16 05:58 AM
                Just a thoughtAnon2016/08/16 07:45 AM
                  Just a thought2016/08/16 08:36 AM
            Branch/jump target predictionLinus Torvalds2016/08/14 09:40 AM
              Branch/jump target prediction2016/08/16 05:40 AM
                Branch/jump target predictionRicardo B2016/08/16 06:39 AM
                  Branch/jump target prediction -82016/08/16 08:23 AM
                    Branch/jump target prediction -8anon2016/08/16 09:09 AM
                    Branch/jump target prediction -8Ricardo B2016/08/16 09:33 AM
                      Branch/jump target prediction -8Exophase2016/08/16 10:02 AM
                        Branch/jump target prediction -8Ricardo B2016/08/16 10:31 AM
                        SPU hbr instruction (hint for branch)vvid2016/08/16 11:31 AM
                        Branch/jump target prediction -8no name2016/08/17 07:16 AM
                    Branch/jump target prediction -8Gabriele Svelto2016/08/16 10:46 AM
                      Branch/jump target prediction -8Etienne2016/08/17 12:27 AM
                        Branch/jump target prediction -8Gabriele Svelto2016/08/17 02:52 AM
                    Branch/jump target prediction -8Maynard Handley2016/08/18 09:02 AM
                      Branch/jump target prediction -82016/08/18 05:21 PM
                        Branch/jump target prediction -8Maynard Handley2016/08/18 06:27 PM
                          Branch/jump target prediction -8Megol2016/08/19 03:29 AM
                          Part 1/N - CPU-internal JIT2016/08/19 03:44 AM
                        Atom, you're such a comedian.Jim Trent2016/08/18 09:39 PM
                          Atom, you're such a comedian.2016/08/19 02:23 AM
                      Branch/jump target prediction -8Etienne2016/08/19 12:25 AM
                        Branch/jump target prediction -8Simon Farnsworth2016/08/19 03:17 AM
                          Branch/jump target prediction -8Michael S2016/08/19 05:39 AM
                          Branch/jump target prediction -8anon2016/08/19 06:29 AM
                            Branch/jump target prediction -8Simon Farnsworth2016/08/19 07:34 AM
                              Branch/jump target prediction -8anon2016/08/19 07:48 AM
                                Branch/jump target prediction -8Exophase2016/08/19 10:03 AM
                                Branch/jump target prediction -8Maynard Handley2016/08/19 10:34 AM
                            Branch/jump target prediction -8David Kanter2016/08/19 11:23 PM
                        Branch/jump target prediction -8Ricardo B2016/08/19 06:18 AM
                          Branch/jump target prediction -8Maynard Handley2016/08/19 07:41 AM
                            Branch/jump target prediction -8Michael S2016/08/19 08:26 AM
                              Branch/jump target prediction -8Maynard Handley2016/08/19 12:47 PM
                                Branch/jump target prediction -8Michael S2016/08/21 12:53 AM
                                  Branch/jump target prediction -8Ricardo B2016/08/22 04:17 AM
                                    Branch/jump target prediction -8Michael S2016/08/22 04:58 AM
                                      Branch/jump target prediction -8Ricardo B2016/08/22 06:50 AM
                            Branch/jump target prediction -8Simon Farnsworth2016/08/19 08:28 AM
                              Branch/jump target prediction -8Simon Farnsworth2016/08/19 08:40 AM
                            Branch/jump target prediction -8David Kanter2016/08/22 11:05 PM
                              Branch/jump target prediction -8Maynard Handley2016/08/23 06:49 AM
                      Branch/jump target prediction -8anon2016/08/26 07:00 AM
                        Branch/jump target prediction -8anon2016/08/26 07:14 AM
                Branch/jump target predictionMegol2016/08/19 03:23 AM
          Branch/jump target predictionMegol2016/08/19 06:42 AM
            Branch/jump target predictionMaynard Handley2016/08/19 10:46 AM
              Branch/jump target predictionDavid Kanter2016/08/19 11:34 PM
                Branch/jump target predictionMaynard Handley2016/08/20 06:07 AM
            Branch/jump target predictionsylt2016/08/19 10:48 AM
              Branch/jump target predictionsylt2016/08/19 11:00 AM
              Branch/jump target predictionMegol2016/08/21 09:27 AM
                The (apparent) state of trace caches on modern CPUsMaynard Handley2016/08/22 02:10 PM
                  The (apparent) state of trace caches on modern CPUsExophase2016/08/22 07:55 PM
                    The (apparent) state of trace caches on modern CPUsanon2016/08/22 11:36 PM
                      The (apparent) state of trace caches on modern CPUsExophase2016/08/23 04:08 AM
                        The (apparent) state of trace caches on modern CPUsanon2016/08/23 08:51 PM
                          The (apparent) state of trace caches on modern CPUsExophase2016/08/23 10:12 PM
                          The (apparent) state of trace caches on modern CPUsMaynard Handley2016/08/24 06:38 AM
                            The (apparent) state of trace caches on modern CPUsanon2016/08/24 07:26 PM
                    The (apparent) state of trace caches on modern CPUsMaynard Handley2016/08/23 06:48 AM
                      That's not trueDavid Kanter2016/08/23 08:39 AM
                        That's not trueMaynard Handley2016/08/23 08:56 AM
                      The (apparent) state of trace caches on modern CPUsanon2016/08/23 08:54 PM
                  The (wrong) state of trace caches on modern CPUsEric Bron2016/08/25 01:38 AM
                    The (wrong) state of trace caches on modern CPUsMichael S2016/08/25 02:28 AM
                      The (wrong) state of trace caches on modern CPUsEric Bron2016/08/25 06:12 AM
                      The (wrong) state of trace caches on modern CPUsMaynard Handley2016/08/25 08:50 AM
                        The (wrong) state of trace caches on modern CPUsMichael S2016/08/25 09:36 AM
                          The (wrong) state of trace caches on modern CPUsExophase2016/08/25 10:32 AM
                        The (wrong) state of trace caches on modern CPUsEric Bron2016/08/25 10:12 AM
                          The (wrong) state of trace caches on modern CPUsMaynard Handley2016/08/25 11:01 AM
                            The (wrong) state of trace caches on modern CPUsEric Bron2016/08/25 11:20 AM
                              The (wrong) state of trace caches on modern CPUsMaynard Handley2016/08/25 12:34 PM
        Branch/jump target predictionGabriele Svelto2016/08/11 12:15 PM
  Branch/jump target predictionGabriele Svelto2016/08/20 06:21 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell tangerine? 🍊