5-6 wide core, why no mention from Intel?

By: Wilco (Wilco.Dijkstra.delete@this.ntlworld.com), October 2, 2015 4:06 pm
Room: Moderated Discussions
David Kanter (dkanter.delete@this.realworldtech.com) on October 2, 2015 4:15 pm wrote:
> Wilco (Wilco.Dijkstra.delete@this.ntlworld.com) on October 2, 2015 3:59 pm wrote:
> > David Kanter (dkanter.delete@this.realworldtech.com) on October 2, 2015 2:59 pm wrote:
> > > Wilco (Wilco.Dijkstra.delete@this.ntlworld.com) on October 2, 2015 1:48 pm wrote:
> > > > David Kanter (dkanter.delete@this.realworldtech.com) on October 1, 2015 4:49 pm wrote:
> > > > > Maynard Handley (name99.delete@this.name99.org) on October 1, 2015 1:57 pm wrote:
> > > >
> > > > > > Of course Apple [and ARM in general] have the advantage of load-store pair which isn't
> > > > > > perfect (eg it's not going to help your vector throughput) but certainly helps in a large
> > > > > > set of common cases, and so reduces the pressure to amp up to Intel's 2+1 load/store.)
> > > > >
> > > > > Uh, LDP and STP mean its more likely that you want multiple load/store
> > > > > units. I'm not 100% sure of the semantics and benefits.
> > > >
> > > > No - it means you can achieve twice the bandwidth from a single load/store unit, and thus reduce
> > > > the need to add another.
> > >
> > > Can LDP/STP only target integer registers? The load/store units are
> > > only 128b wide, so you can't really do a double vector load.
> >
> > No, LDP/STP support integer, FP and vector registers. You can use LDP and STP even
> > if a CPU can't do it in a single cycle. This means it will run faster on some implementations
> > while having no negative effect on implementations that can't.
>
> Thanks for clearing that up!
>
> > > > Many of the workloads in SPEC2006 do simple stride-1 accesses, so
> > > > LDP/STP is extremely effective there -
> > >
> > > What % of loads and stores are pairs?
> >
> > I don't have detailed figures, but it's surprisingly common in hot loops in many benchmarks.
>
> I'm more curious about something like GCC that at least resembles normal
> code. Using LDP in Geekbench isn't actually helpful for real code.

GCC does do a lot of function calls. Not sure whether there are performance counters that can count load vs LDP, but a static count should give a reasonable idea anyway given GCC is not loop heavy.

> > > >besides making function entry/exit and all the memcpy
> > > > and string functions efficient of course.
> > >
> > > Memory and string functions should be using 128b vectors anyway. I see the
> > > biggest use case being for function calls and exits. But I could be convinced
> > > otherwise with data (or at least a better explanation of how it works).
> >
> > Vector instructions have much higher latencies than integer ones, and most strings
> > are fairly short, so they are not always the best option for string/mem functions.
>
> That's fair. But for a memcopy, I'd expect that vector ops are the way to go if you need to clear a page.

For clearing there is a special clear instruction - current cores clear 64-128 bytes per instruction as fast as L1 cache can write back into L2.

> > > > There is also the advantage of fewer instructions
> > > > to decode, rename and execute, so even a design with 2 load/store units benefits.
> > >
> > > Does it truly decode into 1 uop on modern uarchs? I am a bit skeptical, since
> > > it targets two registers, and there is the potential for line/page crossing.
> >
> > Yes the most common cases do indeed. Eg. Cortex-A57 can execute a long sequence of LDP's
> > of 2 64-bit registers at 1 cycle per instruction as long as it hits L1 and doesn't cross
> > a cacheline boundary. There aren't even any penalties for unaligned accesses.
>
> Wait. How does the decoder detect common cases? That doesn't make sense.

The decoder can decide whether to split based on writeback and/or width and type of the access. LDP of 2 32/64-bit registers is most common, so Cortex-A57 executes 32/64-bit LDPs in 1 cycle. 128-bit LDP takes 2 cycles (the manual doesn't say whether 1 or 2 micro-ops are used).

> It sounds like there is some logic in the pipeline for replays.

Every OoO core must support replays to deal with L1 and TLB misses. Doing a pipeline flush would be insane.

> > > > > For example, what happens if the pair loads target different pages?
> > > > > You'd need to do two separate translations through the TLB.
> > > >
> > > > Then the same thing happens as for any other load or store that crosses a page
> > > > or cacheline boundary. If you used 2 separate loads, you now have 3 accesses
> > > > for the split case rather than 2, so a wider load is always cheaper.
> > >
> > > My point is that handling this as two uops instead of one
> > > is much easier. I suspect the semantics for LDP/STP
> > > specifically do not guarantee atomic behavior, which makes me suspect it decodes into two uops.
> >
> > An implementation is allowed to decode into 2 uops indeed, but it would be a bad idea
> > to split 32-bit or 64-bit LDP/STP if you can already deal with 128-bit Q registers.
>
> Not all implementations have full width 128b datapath from the L1D.

Low-end implementations won't indeed, but they are less likely to be OoO.

Wilco
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Update to Intel Optimization ManualSHK2015/09/29 05:38 AM
  gather speedEric Bron2015/09/29 09:43 AM
    gather speedGabriele Svelto2015/09/29 12:00 PM
  Update to Intel Optimization ManualTim McCaffrey2015/09/29 11:18 AM
    Update to Intel Optimization ManualSHK2015/09/29 12:04 PM
      Update to Intel Optimization ManualAnon2015/09/29 02:23 PM
    Update to Intel Optimization Manualnone2015/09/29 10:31 PM
      Update to Intel Optimization ManualMichael S2015/09/30 04:24 AM
    Update to Intel Optimization ManualMichael S2015/09/30 04:30 AM
      Update to Intel Optimization ManualTim McCaffrey2015/09/30 10:01 AM
  5-6 wide core, why no mention from Intel?Wouter Tinus2015/09/30 02:14 PM
    5-6 wide core, why no mention from Intel?Maynard Handley2015/09/30 03:30 PM
      5-6 wide core, why no mention from Intel?Alberto2015/10/01 12:13 AM
        5-6 wide core, why no mention from Intel?anon2015/10/01 02:21 AM
          5-6 wide core, why no mention from Intel?Alberto2015/10/01 04:41 AM
            5-6 wide core, why no mention from Intel?anon2015/10/01 05:27 AM
              5-6 wide core, why no mention from Intel?Alberto2015/10/01 08:33 AM
                5-6 wide core, why no mention from Intel?juanrga2015/10/01 10:24 AM
        5-6 wide core, why no mention from Intel?Maynard Handley2015/10/01 08:57 AM
    5-6 wide core, why no mention from Intel?juanrga2015/10/01 03:59 AM
      5-6 wide core, why no mention from Intel?Wouter Tinus2015/10/01 02:48 PM
        5-6 wide core, why no mention from Intel?juanrga2015/10/03 03:17 AM
          5-6 wide core, why no mention from Intel?Wouter Tinus2015/10/03 11:19 AM
            Are you kidding? (NT)juanrga2015/10/04 05:30 AM
              Are you kidding?Wouter Tinus2015/10/04 03:18 PM
                Are you kidding?juanrga2015/10/05 09:46 AM
                  Are you kidding?David Kanter2015/10/05 11:24 AM
                    Are you kidding?anon2015/10/05 09:26 PM
                    Are you kidding?Linus Torvalds2015/10/07 04:49 AM
                      Are you kidding?juanrga2015/10/07 10:46 AM
                        Are you kidding?anon2015/10/07 06:21 PM
                  Are you kidding?Wouter Tinus2015/10/05 01:25 PM
                    Are you kidding?juanrga2015/10/06 10:17 AM
                      Are you kidding?Stubabe2015/10/07 12:17 AM
                        Are you kidding?juanrga2015/10/07 10:56 AM
                          Amazing...Wouter Tinus2015/10/07 11:31 AM
                            Amazing...juanrga2015/10/07 03:45 PM
                          Are you kidding?Stubabe2015/10/07 11:57 AM
                            Are you kidding?juanrga2015/10/07 03:59 PM
                          Are you kidding?Wilco2015/10/07 02:07 PM
                            Are you kidding?juanrga2015/10/07 04:33 PM
      5-6 wide core, why no mention from Intel?Eric Bron2015/10/04 04:18 AM
    5-6 wide core, why no mention from Intel?David Kanter2015/10/01 09:01 AM
      Optimal number and kind of execution unitsjuanrga2015/10/01 10:50 AM
        Optimal number and kind of execution unitsPatrick Chase2015/10/01 04:38 PM
          Optimal number and kind of execution unitsI.S.T.2015/10/01 05:10 PM
            Optimal number and kind of execution unitsPatrick Chase2015/10/01 11:39 PM
          Optimal number and kind of execution unitsExophase2015/10/01 10:11 PM
          Optimal number and kind of execution unitsjuanrga2015/10/02 05:14 AM
      LD/ST unitsSHK2015/10/01 11:11 AM
        LD/ST unitsDavid Kanter2015/10/01 12:54 PM
          LD/ST unitsSHK2015/10/02 04:55 AM
            LD/ST unitsJukka Larja2015/10/02 09:49 PM
        LD/ST unitsMaynard Handley2015/10/01 01:01 PM
          LD/ST unitsanon2015/10/01 09:54 PM
      5-6 wide core, why no mention from Intel?Maynard Handley2015/10/01 12:57 PM
        5-6 wide core, why no mention from Intel?David Kanter2015/10/01 03:49 PM
          5-6 wide core, why no mention from Intel?Maynard Handley2015/10/01 06:21 PM
          5-6 wide core, why no mention from Intel?Exophase2015/10/01 10:07 PM
            5-6 wide core, why no mention from Intel?Maynard Handley2015/10/02 12:10 AM
              5-6 wide core, why no mention from Intel?Megol2015/10/02 03:39 AM
                5-6 wide core, why no mention from Intel?Michael S2015/10/02 04:27 AM
                5-6 wide core, why no mention from Intel?Maynard Handley2015/10/02 09:37 AM
                  5-6 wide core, why no mention from Intel?noko2015/10/02 05:19 PM
              5-6 wide core, why no mention from Intel?Exophase2015/10/02 06:43 AM
                5-6 wide core, why no mention from Intel?Maynard Handley2015/10/02 09:45 AM
                  5-6 wide core, why no mention from Intel?Exophase2015/10/02 10:23 AM
          5-6 wide core, why no mention from Intel?Wilco2015/10/02 12:48 PM
            5-6 wide core, why no mention from Intel?Maynard Handley2015/10/02 01:25 PM
              5-6 wide core, why no mention from Intel?Wilco2015/10/02 02:26 PM
              5-6 wide core, why no mention from Intel?noko2015/10/02 05:45 PM
                5-6 wide core, why no mention from Intel?Maynard Handley2015/10/02 06:54 PM
            5-6 wide core, why no mention from Intel?David Kanter2015/10/02 01:59 PM
              5-6 wide core, why no mention from Intel?Wilco2015/10/02 02:59 PM
                5-6 wide core, why no mention from Intel?David Kanter2015/10/02 03:15 PM
                  5-6 wide core, why no mention from Intel?Wilco2015/10/02 04:06 PM
                    LDP/STP usage in AArch64 for 403.gccnone2015/10/03 01:04 AM
                      LDP/STP usage in AArch64 for 403.gccWilco2015/10/03 03:02 AM
                        LDP/STP usage in AArch64 for 403.gccnone2015/10/03 03:11 AM
                          LDP/STP usage in AArch64 for 403.gccWilco2015/10/03 03:37 AM
                            LDP/STP usage in AArch64 for 403.gccnone2015/10/03 04:37 AM
                              LDP/STP usage in AArch64 for 403.gccWilco2015/10/03 05:26 AM
                  5-6 wide core, why no mention from Intel?Maynard Handley2015/10/02 04:24 PM
              5-6 wide core, why no mention from Intel?Maynard Handley2015/10/02 03:07 PM
  Update to Intel Optimization Manualanon2015/09/30 04:43 PM
  Update to Intel Optimization ManualPatrick Chase2015/09/30 09:44 PM
    Update to Intel Optimization Manualanon2015/09/30 10:49 PM
    Update to Intel Optimization Manualnone2015/09/30 10:50 PM
    Update to Intel Optimization ManualDavid Kanter2015/10/01 12:52 PM
      Update to Intel Optimization ManualPatrick Chase2015/10/01 04:16 PM
        Update to Intel Optimization Manualanon2015/10/01 10:45 PM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell avocado?