Interesting comment about rep instructions & code size

By: Travis Downs (travis.downs.delete@this.gmail.com), January 16, 2020 2:26 pm
Room: Moderated Discussions
Chester (lamchester.delete@this.gmail.com) on January 15, 2020 6:24 pm wrote:
> > > Maybe the cost of saving icache misses and branch mispredicts would be worth it?
> >
> > It could be, for some stuff (especially at Google where they are famous for jumping
> > through hoops to reduce icache misses due to relatively large code sizes).
> >
> > I will never always be a good tradeoff, i.e,. regardless of your
> > icache pressure because the small size behavior is too poor.
> >
> > Note also that there some really rediculous memcpy and memcmp (especially) implementations,
> > like thousands of instructions, so if you're comparing it to that, then yeah - but if
> > you compare it to a good size-and-perforamnce-optimized implementation, which would
> > be maybe one or two dozen instructions for the small cases them maybe not.
>
> Small size behavior's gonna be bad anyway if a call to memcpy incurs
> an icache miss and maybe a couple branch mispredicts afterward.

Eh, sure - but you can can say that for any X:

> Small size behavior's gonna be bad anyway if a call to X incurs
> an icache miss and maybe a couple branch mispredicts afterward.

Why would you assume that this would be the case? We are of course also interested in cases where you don't an icache miss, and branch mispredicts afterwards, and of course these cases will be very common. In particular, in cases where memcpy is a significant portion the runtime (where its performance matters relative more), we don't expect either of those things to be common (the mispredict *inside* memcpy is of sometimes unavoidable if the length isn't predictable, even after "quantization").

>
> Would you happen to know the rep movsb startup time on modern architectures like Skylake?

Back to back it looks like about 24 cycles for a 1 byte copy on Skylake, from uops.info. it takes 67 uops. So it's doing a lot of work. 0 byte copies are fewer uops but about twice as slow, which is kind of weird: you don't intentionally copy 0 bytes but of course it comes up in practice all the type with dynamic length.

Ice Lake is much better.

>
> > > > I find this quote a bit enigmatic: so are they leveraging the rep move instructions,
> > > > in which case the implementation is very simple, or are they doing what they said elsewhere,
> > > > which is using compile-time selected instructions to implement a compact memcopy.
> > > >
> > > > I am surprised they said PLT-based dispatch isn't efficient: as far as I can tell, it's basically zero
> > > > cost if you were calling the memcpy function: either way you are making a call through the PLT, so what
> > > > downside is there to having the machine-appropriate entry be selected at dynamic load time?
> > >
> > > What's PLT-based dispatch? I googled and couldn't find anything on it.
> >
> > This.
> >
> > Basically dynamically linked symbol loading happens at runtime anyways, through a layer of indirection,
> > so if you make the symbol look up arch-aware you basically arch-aware dispatch for free (again under
> > the assumption you were going to make the function call through the PLT in the first place).
> >
>
> Oh interesting. Still it wouldn't handle optimizing for small vs large memcpys,
> because the same call could alternate between tiny and large copies.

Yeah, it won't do that: it's only to handle platform-specific opts, and only when the function is actually called. You can also just build this yourself w/o relying on IFUNC with a function pointer, which generally has the ~same performance as a plain call anyways.

Assume the rep stuff sucks, the ideal memcpy, when size isn't known, is to inline a small fast case which uses only the instructions enabled by the current -march, and doesn't unecessarily use instructions that might cause a downclocking (although this easy to say, hard to do): this part handles small copies, then if that didn't do all the bytes, call a library memcpy, which itself is designed to assume the memcpy is small(ish) and works well for small copies and doesn't destroy the icache, and this thing finally falls back to the full-blown unrolled memcpy if the region is large enough.

These cascading fallbacks don't really cost you much because they always favor the small case, and in the bigger case the few extra cycles never matters.

This still isn't all that great though, just because it has to handle such a wide range of sizes. A per-site apaptive memcpy can do better: even one that adapts at runtime, but much better (well, faster but harder to use) is one that collects profiling info in a separate profiling run, then compiles-in the ideal memcpy based on that data (but still does fine if the data is different).

< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
LLVM comments on mem*Maynard Handley2020/01/14 01:51 PM
  LLVM comments on mem*Anon32020/01/15 06:28 AM
  Interesting comment about rep instructions & code sizeGabriele Svelto2020/01/15 07:12 AM
    Interesting comment about rep instructions & code sizenone2020/01/15 08:59 AM
      Interesting comment about rep instructions & code sizeGabriele Svelto2020/01/16 03:56 AM
        Interesting comment about rep instructions & code sizeLinus Torvalds2020/01/16 10:12 AM
          ISA support for constant count loopsPaul A. Clayton2020/01/16 11:28 AM
            ISA support for constant count loopsGabriele Svelto2020/01/16 02:15 PM
              PowerPC "front-end registers"Paul A. Clayton2020/01/16 03:34 PM
              ISA support for constant count loopsTravis Downs2020/01/16 05:21 PM
                ISA support for constant count loopsLinus Torvalds2020/01/16 08:41 PM
                  ISA support for constant count loopsTravis2020/01/16 09:48 PM
                    ISA support for constant count loopsBrett2020/01/17 01:28 AM
              Branch to CTRMaya2020/01/18 08:15 AM
                Branch to CTRGabriele Svelto2020/01/18 01:14 PM
            ISA support for constant count loopsanon2020/01/17 08:28 AM
              ISA support for constant count loopsTravis Downs2020/01/17 08:34 AM
            ISA support for constant count loops: ineffective compared to micro-threads2020/01/20 08:02 AM
              ISA support for constant count loops: ineffective compared to micro-threadssomeone2020/01/20 12:23 PM
                ISA support for constant count loops: ineffective compared to micro-threads2020/01/22 09:23 AM
              ISA support for too slow computersEtienne2020/01/21 02:42 AM
                ISA support for constant count loops: ineffective compared to micro-threads2020/01/22 09:18 AM
                  ISA support for constant count loops: ineffective compared to micro-threads2020/01/22 10:04 AM
                  ISA support for constant count loops: ineffective compared to micro-threadsHeikki Kultala2020/01/22 10:47 AM
                    ISA support for constant count loops: ineffective compared to micro-threadsdmcq2020/01/22 01:31 PM
                    ISA support for constant count loops: ineffective compared to micro-threads2020/01/22 03:28 PM
                      ISA support for constant count loops: ineffective compared to micro-threadsEtienne2020/01/22 04:35 PM
          Interesting comment about rep instructions & code sizeGabriele Svelto2020/01/16 02:00 PM
    Interesting comment about rep instructions & code sizeTravis Downs2020/01/15 03:40 PM
      Interesting comment about rep instructions & code sizeChester2020/01/15 05:16 PM
        Interesting comment about rep instructions & code sizeTravis Downs2020/01/15 05:50 PM
          Interesting comment about rep instructions & code sizeChester2020/01/15 07:24 PM
            Interesting comment about rep instructions & code sizeTravis Downs2020/01/16 02:26 PM
              Interesting comment about rep instructions & code sizeChester2020/01/17 01:16 PM
                Interesting comment about rep instructions & code sizeTravis Downs2020/01/17 03:41 PM
        Interesting comment about rep instructions & code sizeGabriele Svelto2020/01/16 03:53 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell purple?