5-6 wide core, why no mention from Intel?

By: David Kanter (dkanter.delete@this.realworldtech.com), October 2, 2015 3:15 pm
Room: Moderated Discussions
Wilco (Wilco.Dijkstra.delete@this.ntlworld.com) on October 2, 2015 3:59 pm wrote:
> David Kanter (dkanter.delete@this.realworldtech.com) on October 2, 2015 2:59 pm wrote:
> > Wilco (Wilco.Dijkstra.delete@this.ntlworld.com) on October 2, 2015 1:48 pm wrote:
> > > David Kanter (dkanter.delete@this.realworldtech.com) on October 1, 2015 4:49 pm wrote:
> > > > Maynard Handley (name99.delete@this.name99.org) on October 1, 2015 1:57 pm wrote:
> > >
> > > > > Of course Apple [and ARM in general] have the advantage of load-store pair which isn't
> > > > > perfect (eg it's not going to help your vector throughput) but certainly helps in a large
> > > > > set of common cases, and so reduces the pressure to amp up to Intel's 2+1 load/store.)
> > > >
> > > > Uh, LDP and STP mean its more likely that you want multiple load/store
> > > > units. I'm not 100% sure of the semantics and benefits.
> > >
> > > No - it means you can achieve twice the bandwidth from a single load/store unit, and thus reduce
> > > the need to add another.
> >
> > Can LDP/STP only target integer registers? The load/store units are
> > only 128b wide, so you can't really do a double vector load.
>
> No, LDP/STP support integer, FP and vector registers. You can use LDP and STP even
> if a CPU can't do it in a single cycle. This means it will run faster on some implementations
> while having no negative effect on implementations that can't.

Thanks for clearing that up!

> > > Many of the workloads in SPEC2006 do simple stride-1 accesses, so
> > > LDP/STP is extremely effective there -
> >
> > What % of loads and stores are pairs?
>
> I don't have detailed figures, but it's surprisingly common in hot loops in many benchmarks.

I'm more curious about something like GCC that at least resembles normal code. Using LDP in Geekbench isn't actually helpful for real code.

> > >besides making function entry/exit and all the memcpy
> > > and string functions efficient of course.
> >
> > Memory and string functions should be using 128b vectors anyway. I see the
> > biggest use case being for function calls and exits. But I could be convinced
> > otherwise with data (or at least a better explanation of how it works).
>
> Vector instructions have much higher latencies than integer ones, and most strings
> are fairly short, so they are not always the best option for string/mem functions.

That's fair. But for a memcopy, I'd expect that vector ops are the way to go if you need to clear a page.

> > > There is also the advantage of fewer instructions
> > > to decode, rename and execute, so even a design with 2 load/store units benefits.
> >
> > Does it truly decode into 1 uop on modern uarchs? I am a bit skeptical, since
> > it targets two registers, and there is the potential for line/page crossing.
>
> Yes the most common cases do indeed. Eg. Cortex-A57 can execute a long sequence of LDP's
> of 2 64-bit registers at 1 cycle per instruction as long as it hits L1 and doesn't cross
> a cacheline boundary. There aren't even any penalties for unaligned accesses.

Wait. How does the decoder detect common cases? That doesn't make sense.

It sounds like there is some logic in the pipeline for replays.

> > > > For example, what happens if the pair loads target different pages?
> > > > You'd need to do two separate translations through the TLB.
> > >
> > > Then the same thing happens as for any other load or store that crosses a page
> > > or cacheline boundary. If you used 2 separate loads, you now have 3 accesses
> > > for the split case rather than 2, so a wider load is always cheaper.
> >
> > My point is that handling this as two uops instead of one
> > is much easier. I suspect the semantics for LDP/STP
> > specifically do not guarantee atomic behavior, which makes me suspect it decodes into two uops.
>
> An implementation is allowed to decode into 2 uops indeed, but it would be a bad idea
> to split 32-bit or 64-bit LDP/STP if you can already deal with 128-bit Q registers.

Not all implementations have full width 128b datapath from the L1D.

David
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Update to Intel Optimization ManualSHK2015/09/29 05:38 AM
  gather speedEric Bron2015/09/29 09:43 AM
    gather speedGabriele Svelto2015/09/29 12:00 PM
  Update to Intel Optimization ManualTim McCaffrey2015/09/29 11:18 AM
    Update to Intel Optimization ManualSHK2015/09/29 12:04 PM
      Update to Intel Optimization ManualAnon2015/09/29 02:23 PM
    Update to Intel Optimization Manualnone2015/09/29 10:31 PM
      Update to Intel Optimization ManualMichael S2015/09/30 04:24 AM
    Update to Intel Optimization ManualMichael S2015/09/30 04:30 AM
      Update to Intel Optimization ManualTim McCaffrey2015/09/30 10:01 AM
  5-6 wide core, why no mention from Intel?Wouter Tinus2015/09/30 02:14 PM
    5-6 wide core, why no mention from Intel?Maynard Handley2015/09/30 03:30 PM
      5-6 wide core, why no mention from Intel?Alberto2015/10/01 12:13 AM
        5-6 wide core, why no mention from Intel?anon2015/10/01 02:21 AM
          5-6 wide core, why no mention from Intel?Alberto2015/10/01 04:41 AM
            5-6 wide core, why no mention from Intel?anon2015/10/01 05:27 AM
              5-6 wide core, why no mention from Intel?Alberto2015/10/01 08:33 AM
                5-6 wide core, why no mention from Intel?juanrga2015/10/01 10:24 AM
        5-6 wide core, why no mention from Intel?Maynard Handley2015/10/01 08:57 AM
    5-6 wide core, why no mention from Intel?juanrga2015/10/01 03:59 AM
      5-6 wide core, why no mention from Intel?Wouter Tinus2015/10/01 02:48 PM
        5-6 wide core, why no mention from Intel?juanrga2015/10/03 03:17 AM
          5-6 wide core, why no mention from Intel?Wouter Tinus2015/10/03 11:19 AM
            Are you kidding? (NT)juanrga2015/10/04 05:30 AM
              Are you kidding?Wouter Tinus2015/10/04 03:18 PM
                Are you kidding?juanrga2015/10/05 09:46 AM
                  Are you kidding?David Kanter2015/10/05 11:24 AM
                    Are you kidding?anon2015/10/05 09:26 PM
                    Are you kidding?Linus Torvalds2015/10/07 04:49 AM
                      Are you kidding?juanrga2015/10/07 10:46 AM
                        Are you kidding?anon2015/10/07 06:21 PM
                  Are you kidding?Wouter Tinus2015/10/05 01:25 PM
                    Are you kidding?juanrga2015/10/06 10:17 AM
                      Are you kidding?Stubabe2015/10/07 12:17 AM
                        Are you kidding?juanrga2015/10/07 10:56 AM
                          Amazing...Wouter Tinus2015/10/07 11:31 AM
                            Amazing...juanrga2015/10/07 03:45 PM
                          Are you kidding?Stubabe2015/10/07 11:57 AM
                            Are you kidding?juanrga2015/10/07 03:59 PM
                          Are you kidding?Wilco2015/10/07 02:07 PM
                            Are you kidding?juanrga2015/10/07 04:33 PM
      5-6 wide core, why no mention from Intel?Eric Bron2015/10/04 04:18 AM
    5-6 wide core, why no mention from Intel?David Kanter2015/10/01 09:01 AM
      Optimal number and kind of execution unitsjuanrga2015/10/01 10:50 AM
        Optimal number and kind of execution unitsPatrick Chase2015/10/01 04:38 PM
          Optimal number and kind of execution unitsI.S.T.2015/10/01 05:10 PM
            Optimal number and kind of execution unitsPatrick Chase2015/10/01 11:39 PM
          Optimal number and kind of execution unitsExophase2015/10/01 10:11 PM
          Optimal number and kind of execution unitsjuanrga2015/10/02 05:14 AM
      LD/ST unitsSHK2015/10/01 11:11 AM
        LD/ST unitsDavid Kanter2015/10/01 12:54 PM
          LD/ST unitsSHK2015/10/02 04:55 AM
            LD/ST unitsJukka Larja2015/10/02 09:49 PM
        LD/ST unitsMaynard Handley2015/10/01 01:01 PM
          LD/ST unitsanon2015/10/01 09:54 PM
      5-6 wide core, why no mention from Intel?Maynard Handley2015/10/01 12:57 PM
        5-6 wide core, why no mention from Intel?David Kanter2015/10/01 03:49 PM
          5-6 wide core, why no mention from Intel?Maynard Handley2015/10/01 06:21 PM
          5-6 wide core, why no mention from Intel?Exophase2015/10/01 10:07 PM
            5-6 wide core, why no mention from Intel?Maynard Handley2015/10/02 12:10 AM
              5-6 wide core, why no mention from Intel?Megol2015/10/02 03:39 AM
                5-6 wide core, why no mention from Intel?Michael S2015/10/02 04:27 AM
                5-6 wide core, why no mention from Intel?Maynard Handley2015/10/02 09:37 AM
                  5-6 wide core, why no mention from Intel?noko2015/10/02 05:19 PM
              5-6 wide core, why no mention from Intel?Exophase2015/10/02 06:43 AM
                5-6 wide core, why no mention from Intel?Maynard Handley2015/10/02 09:45 AM
                  5-6 wide core, why no mention from Intel?Exophase2015/10/02 10:23 AM
          5-6 wide core, why no mention from Intel?Wilco2015/10/02 12:48 PM
            5-6 wide core, why no mention from Intel?Maynard Handley2015/10/02 01:25 PM
              5-6 wide core, why no mention from Intel?Wilco2015/10/02 02:26 PM
              5-6 wide core, why no mention from Intel?noko2015/10/02 05:45 PM
                5-6 wide core, why no mention from Intel?Maynard Handley2015/10/02 06:54 PM
            5-6 wide core, why no mention from Intel?David Kanter2015/10/02 01:59 PM
              5-6 wide core, why no mention from Intel?Wilco2015/10/02 02:59 PM
                5-6 wide core, why no mention from Intel?David Kanter2015/10/02 03:15 PM
                  5-6 wide core, why no mention from Intel?Wilco2015/10/02 04:06 PM
                    LDP/STP usage in AArch64 for 403.gccnone2015/10/03 01:04 AM
                      LDP/STP usage in AArch64 for 403.gccWilco2015/10/03 03:02 AM
                        LDP/STP usage in AArch64 for 403.gccnone2015/10/03 03:11 AM
                          LDP/STP usage in AArch64 for 403.gccWilco2015/10/03 03:37 AM
                            LDP/STP usage in AArch64 for 403.gccnone2015/10/03 04:37 AM
                              LDP/STP usage in AArch64 for 403.gccWilco2015/10/03 05:26 AM
                  5-6 wide core, why no mention from Intel?Maynard Handley2015/10/02 04:24 PM
              5-6 wide core, why no mention from Intel?Maynard Handley2015/10/02 03:07 PM
  Update to Intel Optimization Manualanon2015/09/30 04:43 PM
  Update to Intel Optimization ManualPatrick Chase2015/09/30 09:44 PM
    Update to Intel Optimization Manualanon2015/09/30 10:49 PM
    Update to Intel Optimization Manualnone2015/09/30 10:50 PM
    Update to Intel Optimization ManualDavid Kanter2015/10/01 12:52 PM
      Update to Intel Optimization ManualPatrick Chase2015/10/01 04:16 PM
        Update to Intel Optimization Manualanon2015/10/01 10:45 PM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell avocado?