A76

By: Maynard Handley (name99.delete@this.name99.org), August 18, 2018 4:22 pm
Room: Moderated Discussions
Maynard Handley (name99.delete@this.name99.org) on August 18, 2018 2:42 pm wrote:
> Doug S (foo.delete@this.bar.bar) on August 18, 2018 1:43 pm wrote:
> > AM (myname4rwt.delete@this.jee-male.com) on August 18, 2018 1:06 pm wrote:
> > > > > What they promised was 40% higher perf in the same power envelope as A75. And that for A76 on 7nm vs A75
> > > > > on 10nm.
> > > >
> > > > Considering that the difference between 7nm and 10nm is
> > > > less than 40% it does imply improved uArch efficiency.
> > >
> > > This is not true. E.g. TSMC has exactly 40% power reduction in its 7nm tech vs 10nm.
> >
> >
> > TSMC promising 40% power reduction in 7nm versus 10nm does
> > not imply anything about performance. They say "20%
> > better performance, 40% better power" but those types of
> > things have always been one or the other in the past.
> > i.e. you can either get the same performance at 40% less power or 20% more performance at the same power.
> >
> > If ARM is promising 40% more performance, and get 20% from
> > the process due to faster transistor switching speeds,
> > they have to get the other 20% from greater IPC, assuming they aren't also increasing the power budget.
> >
> > Of course, it would be helpful to know what ARM uses to do its
> > measurements. It is possible to achieve huge improvements
> > in SIMD heavy code by allowing the issue/retirement of more
> > SIMD instructions per cycle. If you use (or assume)
> > faster or lower latency RAM, you can get big improvements
> > in FP. Getting improvements in twisty branchy integer
> > code is the most difficult since it isn't just one change
> > but a lot of them to get there. But presumably that's
> > what they are targeting since SIMD and FP are less important on phones than integer code.
> >
> > There's obviously ample room for ARM to make such improvements. Looking at the Galaxy S9 version using the
> > SD845 w/A75 @ 2.8 GHz versus the iPhone X using the A11 @ 2.39 GHz, despite the 17% clock rate handicap the
> > iPhone is ~70% faster on GB4 integer, and that holds even
> > if you focus in on the more twisty branchy subtests
> > like LLVM or HTML5 DOM. So basically double the IPC. What's
> > surprising is not that ARM is claiming a 20% improvement,
> > and further improvements in the future. What's surprising is that they fell that far behind.
> >
> > So I don't think there's any reason to be skeptical about ARM getting a 20% improvement here, though
> > that still leaves them at minimum 50% behind Apple - since Apple will get the same 20% improvement from
> > TSMC, and presumably makes further IPC improvements in the A12. Very significant improvements, if you
> > believe Charlie's claims about the A12 of 50% greater single thread performance... I'm kinda skeptical
> > of that myself, but I guess we'll see next month. Almost
> > makes me wonder if the numbers he saw were benchmarks
> > of an A12X going into an iPad Pro, or possibly even a prototype of something targeted at laptops. 50%
> > higher when they are only getting 20% from process would mean another huge IPC gain.
>
> Is the 50% outrageous? Well who knows. But
> - if you like to go by pattern matching (insofar as that describes what
> different teams are doing and their schedules) we had something like
> - A6 was "safe"
> - A7 was great leap forward
> - A8 was safe tweaks+improvements
> - A9 was great leap forward
> - A10, A11 were safe tweaks and improvements
> - meaning if there are multiple teams, one of which is the "superstar team"
> tasked with great leaps forward, it might be time for their next product?
>
> As for whether it's possible, I'd argue it absolutely is. One problem, of course, is that
> we don't know what Apple has ALREADY implemented to get them to 50% higher IPC than Intel,
> but the copious pool of KIP enhancements seems so far to be only lightly tapped (and most
> of these ideas are orthogonal and can be implemented one at a time, not all at once).
> So there's scope for a lot less dead time waiting on data.
>
> For waiting on instructions, we've known for years that you can reduce I-cache waits to
> basically nothing (except for particular types of server code) through things like
> - ONLY train the pre-fetchers on RETIRED "app" instructions. Don't allow interrupts, or mispredicted instructions,
> to pollute your prefetchers (and ideally do something like mark those lines as LRU rather than MRU)
> - code in L2 is more valuable than data in L2, so give it a better chance of surviving there, through various
> means (this can be through placement/replacement priorities, or explicitly flagging it as I-lines)
>
> Finally for actual execution, there's scope for a nice win (for many purposes) by aggressive use of
> - ALU instructions that feed into the same variable (something like R3=R4+R5 AND R6)
> - compiled to use the same intermediate register and placed back to back so
> R3= R5 & R6
> R3= R3 + R4
> - fused in decode
> - sent through a double-pumped ALU that can do the expected "easy"
> ops (+,-, logicals, MAYBE small shifts?) in a half cycle ala P4
> This doesn't cover everything, obviously, but it gets a fraction of your code resolving
> dependencies at twice the nominal frequency, without much real cost.
> Of course for optimal value this means you need to expand decode proper maybe from 6 to 8 instructions
> per cycle, and you need an extremely aggressive fetch system; but both of these are feasible with
> every instruction as 4bytes wide and no need to care about ARMv7 and THUMB any more.
>
> There's also a whole bunch of interesting work on criticality analysis that basically allows
> you to get all the benefit of OoO and speculation where it's valuable, while not paying
> that cost (in power and in mispeculation+recovery) for the (surprisingly common) case of
> non-critical instructions. This looks to me like very useful work that's not been appreciated
> enough in academia, and that has not (yet...?) had commercial implementation.
>
> There's also a bunch of things that (as far as we know) Apple doesn't yet do but are clearly feasible soon
> (if not this year then 2019 or 2020). These include things that just improve at the system level (hardware
> compressed pages), or remote atomics (to speed up common atomic operations like ARC increments);
> that make the L3 cache effectively larger (compressed lines in L3 and/or in [non compressed page] RAM);
> or an effectively very large LLC (maybe an L3, maybe a "before RAM" "L4") implemented
> not in eDRAM the way these things used to be, but in STT MRAM.
>

Oops left out one more idea. Once you have a machinery to handle remote atomics AND you own the entire stack, you can do some crazy things. For example consider the problem of zeroing an entire page. Obviously doing this on the main CPU sucks, but doing it on a secondary thread (and thus the machinery of callbacks and waits) also sucks. But what about this?
- you extend the cache protocol to include invalidate messages that describe not just a single line but an address plus a count of lines (say up to 256 or so, whatever is necessary to cover a page seems about right)
- you offer an atomic remote zero (or remote fill) instruction that is given a line address plus count. The first thing it does is flush the line from all other caches, then it marks the set of lines locally as in some sort of intermediate state, then, fast as possible, it starts zeroing each line and, once zeroed, it goes back to a standard "valid" state.

The beauty of a scheme like this is that you get the work offloaded from the main CPU, but you don't have to worry about callbacks or exactly when the work is finished; standard coherence protocols will handle all that for you.
The idea could trivially be extended from zeroing/filling up to page-sized copying. And if you own the hardware, the compiler, the libs, and the OS, you could have it delivering value on day one.

People like Onur Mutlu have suggested doing stuff like this in DRAM, and there is absolutely value in doing it there (performance and power) BUT that gets you into the callback API issues again.
Having it (conceptually) deferred to the LLC splits the difference nicely; and it's certainly possible in principle to have the LLC do the coherency part of the job, as I described above, while sending a message to DRAM to have the actual flood fill/copying done in RAM...
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
ARM turns to a god and a heroAM2018/08/16 09:32 AM
  ARM turns to a god and a heroMaynard Handley2018/08/16 09:41 AM
    ARM turns to a god and a heroDoug S2018/08/16 11:11 AM
    ARM turns to a god and a heroGeoff Langdale2018/08/16 11:59 PM
      ARM turns to a god and a herodmcq2018/08/17 05:12 AM
  ARM is somewhat misleadingAdrian2018/08/16 11:56 PM
    It's marketing materialGabriele Svelto2018/08/17 01:00 AM
      It's marketing materialMichael S2018/08/17 03:13 AM
        It's marketing materialdmcq2018/08/17 05:23 AM
          It's marketing materialAndrei Frumusanu2018/08/17 07:25 AM
        It's marketing materialLinus Torvalds2018/08/17 11:20 AM
          It's marketing materialGroo2018/08/17 01:44 PM
            It's marketing materialDoug S2018/08/17 02:14 PM
          promises and deliveriesAM2018/08/17 02:32 PM
            promises and deliveriesPassing Through2018/08/17 03:02 PM
              Just by way of clarification Passing Through2018/08/17 03:15 PM
                Just by way of clarification AM2018/08/18 12:49 PM
                  Just by way of clarification Passing Through2018/08/18 01:34 PM
                    This ain't the nineties any longerPassing Through2018/08/18 01:54 PM
                      This ain't the nineties any longerMaynard Handley2018/08/18 02:50 PM
                        This ain't the nineties any longerPassing Through2018/08/18 03:57 PM
                          This ain't the nineties any longerPassing Through2018/09/06 02:42 PM
                            This ain't the nineties any longerMaynard Handley2018/09/07 04:10 PM
                              This ain't the nineties any longerPassing Through2018/09/07 04:48 PM
                                This ain't the nineties any longerMaynard Handley2018/09/07 05:22 PM
                Just by way of clarification Wilco2018/08/18 01:26 PM
                  Just by way of clarification Passing Through2018/08/18 01:39 PM
                  Just by way of clarification none2018/08/18 10:52 PM
                    Just by way of clarification dmcq2018/08/19 08:32 AM
                      Just by way of clarification none2018/08/19 08:54 AM
                        Just by way of clarification dmcq2018/08/19 11:24 AM
                          Just by way of clarification none2018/08/19 11:52 AM
                  Just by way of clarification Gabriele Svelto2018/08/19 06:41 AM
                    Just by way of clarification Passing Through2018/08/19 09:25 AM
                      Whiteboards at Gatwick airport anyone? Passing Through2018/08/20 04:24 AM
          It's marketing materialMichael S2018/08/18 11:12 AM
          It's marketing materialBrett2018/08/18 05:22 PM
            It's marketing materialBrett2018/08/18 05:33 PM
              It's marketing materialAdrian2018/08/19 01:21 AM
        A76AM2018/08/17 02:45 PM
          A76Michael S2018/08/18 11:20 AM
            A76AM2018/08/18 12:39 PM
              A76Michael S2018/08/18 12:49 PM
                A76AM2018/08/18 01:06 PM
                  A76Doug S2018/08/18 01:43 PM
                    A76Maynard Handley2018/08/18 02:42 PM
                      A76Maynard Handley2018/08/18 04:22 PM
                        Why write zeros when one can use metadata?Paul A. Clayton2018/08/18 06:19 PM
                          Why write zeros when one can use metadata?Maynard Handley2018/08/19 11:12 AM
                            Dictionary compress might apply to memcopyPaul A. Clayton2018/08/19 01:45 PM
                        Instructions for zeroingKonrad Schwarz2018/08/30 06:37 AM
                          Instructions for zeroingMaynard Handley2018/08/30 08:41 AM
                          Instructions for zeroingAdrian2018/08/30 11:37 AM
                            dcbz -> dcbzl (was: Instructions for zeroing)hobold2018/08/31 01:50 AM
                              dcbz -> dcbzl (was: Instructions for zeroing)dmcq2018/09/01 05:28 AM
                      A76Travis2018/08/19 11:36 AM
                        A76Maynard Handley2018/08/19 12:22 PM
                          A76Travis2018/08/19 02:07 PM
                            A76Maynard Handley2018/08/19 06:24 PM
                        Remote atomicsmatthew2018/08/19 12:51 PM
                          Remote atomicsMichael S2018/08/19 01:58 PM
                            Remote atomicsmatthew2018/08/19 02:32 PM
                              Remote atomicsMichael S2018/08/19 02:36 PM
                                Remote atomicsmatthew2018/08/19 02:48 PM
                                  Remote atomicsMichael S2018/08/19 03:16 PM
                                    Remote atomicsRicardo B2018/08/20 10:05 AM
                            Remote atomicsdmcq2018/08/19 02:33 PM
                          Remote atomicsTravis2018/08/19 02:32 PM
                            Remote atomicsMichael S2018/08/19 02:46 PM
                              Remote atomicsTravis2018/08/19 05:35 PM
                                Remote atomicsMichael S2018/08/20 03:29 AM
                            Remote atomicsmatthew2018/08/19 07:58 PM
                              Remote atomicsanon2018/08/20 12:59 AM
                                Remote atomicsTravis2018/08/20 10:26 AM
                              Remote atomicsTravis2018/08/20 09:57 AM
                              Remote atomicsLinus Torvalds2018/08/20 04:29 PM
                                Fitting time slices to execution phasesPaul A. Clayton2018/08/21 09:09 AM
                                  Fitting time slices to execution phasesLinus Torvalds2018/08/21 02:34 PM
                                    Fitting time slices to execution phasesLinus Torvalds2018/08/21 03:31 PM
                                      Fitting time slices to execution phasesGabriele Svelto2018/08/21 03:54 PM
                                        Fitting time slices to execution phasesLinus Torvalds2018/08/21 04:26 PM
                                      Fitting time slices to execution phasesTravis2018/08/21 04:21 PM
                                        Fitting time slices to execution phasesLinus Torvalds2018/08/21 04:39 PM
                                          Fitting time slices to execution phasesTravis2018/08/21 04:59 PM
                                            Fitting time slices to execution phasesLinus Torvalds2018/08/21 05:13 PM
                                      Fitting time slices to execution phasesanon2018/08/21 04:27 PM
                                        Fitting time slices to execution phasesLinus Torvalds2018/08/21 06:02 PM
                                          Fitting time slices to execution phasesEtienne2018/08/22 02:28 AM
                                        Fitting time slices to execution phasesGabriele Svelto2018/08/22 03:07 PM
                                          Fitting time slices to execution phasesTravis2018/08/22 04:00 PM
                                          Fitting time slices to execution phasesanon2018/08/22 06:52 PM
                                    Fitting time slices to execution phasesTravis2018/08/21 04:37 PM
                                    Is preventing misuse that complex?Paul A. Clayton2018/08/23 05:42 AM
                                      Is preventing misuse that complex?Linus Torvalds2018/08/23 12:46 PM
                                        Is preventing misuse that complex?Travis2018/08/23 01:29 PM
                                          Is preventing misuse that complex?Travis2018/08/23 01:33 PM
                                            Is preventing misuse that complex?Jeff S.2018/08/24 07:57 AM
                                              Is preventing misuse that complex?Travis2018/08/24 08:47 AM
                                          Is preventing misuse that complex?Linus Torvalds2018/08/23 02:30 PM
                                            Is preventing misuse that complex?Travis2018/08/23 03:11 PM
                                              Is preventing misuse that complex?Linus Torvalds2018/08/24 01:00 PM
                                                Is preventing misuse that complex?Gabriele Svelto2018/08/24 01:25 PM
                                                  Is preventing misuse that complex?Linus Torvalds2018/08/24 01:33 PM
                                  Fitting time slices to execution phasesTravis2018/08/21 03:54 PM
                                rseq: holy grail rwlock?Travis2018/08/21 03:18 PM
                                  rseq: holy grail rwlock?Linus Torvalds2018/08/21 03:59 PM
                                    rseq: holy grail rwlock?Travis2018/08/21 04:27 PM
                                      rseq: holy grail rwlock?Linus Torvalds2018/08/21 05:10 PM
                                        rseq: holy grail rwlock?Travis2018/08/21 06:21 PM
                  ARM design housesMichael S2018/08/21 05:07 AM
                    ARM design housesWilco2018/08/22 12:38 PM
                      ARM design housesMichael S2018/08/22 02:21 PM
                        ARM design housesWilco2018/08/22 03:23 PM
                          ARM design housesMichael S2018/08/29 01:58 AM
                            Qualcomm's core naming scheme really, really sucksHeikki Kultala2018/08/29 02:19 AM
                A76Maynard Handley2018/08/18 02:07 PM
                  A76Michael S2018/08/18 02:32 PM
                    A76Maynard Handley2018/08/18 02:52 PM
                      A76Michael S2018/08/18 03:04 PM
    ARM is somewhat misleadingjuanrga2018/08/17 01:20 AM
    Surprised??Alberto2018/08/17 01:52 AM
      Surprised??Alberto2018/08/17 02:10 AM
      Surprised??none2018/08/17 02:46 AM
      Garbage talkAndrei Frumusanu2018/08/17 07:30 AM
        Garbage talkMichael S2018/08/17 07:43 AM
          Garbage talkAndrei Frumusanu2018/08/17 09:51 AM
            Garbage talkMichael S2018/08/18 11:29 AM
        Garbage talkAdrian2018/08/17 08:28 AM
          Garbage talkAlberto2018/08/17 09:20 AM
          Garbage talkAndrei Frumusanu2018/08/17 09:48 AM
            Garbage talkAdrian2018/08/17 10:17 AM
              Garbage talkAndrei Frumusanu2018/08/17 10:36 AM
                Garbage talkAdrian2018/08/17 02:53 PM
                  Garbage talkAndrei Frumusanu2018/08/18 12:17 AM
        More like a religion he?? ARM has an easy life :)Alberto2018/08/17 09:13 AM
          More like a religion he?? ARM has an easy life :)Andrei Frumusanu2018/08/17 09:34 AM
            More like a religion he?? ARM has an easy life :)Alberto2018/08/17 10:03 AM
              More like a religion he?? ARM has an easy life :)Andrei Frumusanu2018/08/17 10:43 AM
              More like a religion he?? ARM has an easy life :)Doug S2018/08/17 02:17 PM
              15W phone SoCsAM2018/08/17 03:04 PM
          More like a religion he?? ARM has an easy life :)Maynard Handley2018/08/17 12:29 PM
  my future stuff will be better than your old stuff, hey I'm a god at last (NT)Eric Bron2018/08/18 03:34 AM
    my future stuff will be better than your old stuff, hey I'm a god at lastnone2018/08/18 08:34 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell avocado?