Interesting comment about rep instructions & code size

By: Chester (lamchester.delete@this.gmail.com), January 17, 2020 1:16 pm
Room: Moderated Discussions
> > Small size behavior's gonna be bad anyway if a call to memcpy incurs
> > an icache miss and maybe a couple branch mispredicts afterward.
>
> Eh, sure - but you can can say that for any X:
>
> > Small size behavior's gonna be bad anyway if a call to X incurs
> > an icache miss and maybe a couple branch mispredicts afterward.
>
> Why would you assume that this would be the case? We are of course also interested in cases where you
> don't an icache miss, and branch mispredicts afterwards, and of course these cases will be very common.
> In particular, in cases where memcpy is a significant portion the runtime (where its performance matters
> relative more), we don't expect either of those things to be common (the mispredict *inside* memcpy
> is of sometimes unavoidable if the length isn't predictable, even after "quantization").

What workloads have memcpy as a significant portion of runtime? In stuff I've profiled, memcpy has taken 1% (or less) of unhalted cycles. It's like 0.5% in the case of Firefox, and 0.3% for code compilation.

Also looking closer, memcpy in code compilation (sandy bridge) spends almost half its cycles stalled on a full store buffer. So FE-related optimization potential there is already limited.

In Firefox/webxprt3 on Haswell, the vcruntime memcpy ends up on rep movsb a lot. VTune says ~28% of memcpy's uops come from the microcode sequencer, and ms_uops/ms_cycles = 3.55. It doesn't really suffer from icache misses and "only" spends 20% of cycles stalled on a full store buffer. Unlike the previous SnB example, it also gets better branch prediction accuracy (~97% vs ~95% fwiw). Not sure what all that means just yet. It could just be rep movsb inefficiency making things less backend bound.

> > Would you happen to know the rep movsb startup time on modern architectures like Skylake?
>
> Back to back it looks like about 24 cycles for a 1 byte copy on Skylake, from uops.info. it takes 67 uops. So
> it's doing a lot of work. 0 byte copies are fewer uops but about twice as slow, which is kind of weird: you
> don't intentionally copy 0 bytes but of course it comes up in practice all the type with dynamic length.
>
> Ice Lake is much better.

Huh. Going by https://uops.info/html-instr/MOVSB_REPE.html, Conroe has a (reciprocal?) throughput of 12 cycles, beating out even Ice Lake? And spits out fewer uops than SKL/HSW?

> > Oh interesting. Still it wouldn't handle optimizing for small vs large memcpys,
> > because the same call could alternate between tiny and large copies.
>
> Yeah, it won't do that: it's only to handle platform-specific opts, and only when the function
> is actually called. You can also just build this yourself w/o relying on IFUNC with a function
> pointer, which generally has the ~same performance as a plain call anyways.
>
> Assume the rep stuff sucks, the ideal memcpy, when size isn't known, is to inline a small fast case
> which uses only the instructions enabled by the current -march, and doesn't unecessarily use instructions
> that might cause a downclocking (although this easy to say, hard to do): this part handles small copies,
> then if that didn't do all the bytes, call a library memcpy, which itself is designed to assume the
> memcpy is small(ish) and works well for small copies and doesn't destroy the icache, and this thing
> finally falls back to the full-blown unrolled memcpy if the region is large enough.
>
> These cascading fallbacks don't really cost you much because they always favor
> the small case, and in the bigger case the few extra cycles never matters.
>
> This still isn't all that great though, just because it has to handle such a wide range of sizes.
> A per-site apaptive memcpy can do better: even one that adapts at runtime, but much better (well,
> faster but harder to use) is one that collects profiling info in a separate profiling run, then compiles-in
> the ideal memcpy based on that data (but still does fine if the data is different).

Yeah that'd be ideal, and quite complicated...
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
LLVM comments on mem*Maynard Handley2020/01/14 01:51 PM
  LLVM comments on mem*Anon32020/01/15 06:28 AM
  Interesting comment about rep instructions & code sizeGabriele Svelto2020/01/15 07:12 AM
    Interesting comment about rep instructions & code sizenone2020/01/15 08:59 AM
      Interesting comment about rep instructions & code sizeGabriele Svelto2020/01/16 03:56 AM
        Interesting comment about rep instructions & code sizeLinus Torvalds2020/01/16 10:12 AM
          ISA support for constant count loopsPaul A. Clayton2020/01/16 11:28 AM
            ISA support for constant count loopsGabriele Svelto2020/01/16 02:15 PM
              PowerPC "front-end registers"Paul A. Clayton2020/01/16 03:34 PM
              ISA support for constant count loopsTravis Downs2020/01/16 05:21 PM
                ISA support for constant count loopsLinus Torvalds2020/01/16 08:41 PM
                  ISA support for constant count loopsTravis2020/01/16 09:48 PM
                    ISA support for constant count loopsBrett2020/01/17 01:28 AM
              Branch to CTRMaya2020/01/18 08:15 AM
                Branch to CTRGabriele Svelto2020/01/18 01:14 PM
            ISA support for constant count loopsanon2020/01/17 08:28 AM
              ISA support for constant count loopsTravis Downs2020/01/17 08:34 AM
            ISA support for constant count loops: ineffective compared to micro-threads2020/01/20 08:02 AM
              ISA support for constant count loops: ineffective compared to micro-threadssomeone2020/01/20 12:23 PM
                ISA support for constant count loops: ineffective compared to micro-threads2020/01/22 09:23 AM
              ISA support for too slow computersEtienne2020/01/21 02:42 AM
                ISA support for constant count loops: ineffective compared to micro-threads2020/01/22 09:18 AM
                  ISA support for constant count loops: ineffective compared to micro-threads2020/01/22 10:04 AM
                  ISA support for constant count loops: ineffective compared to micro-threadsHeikki Kultala2020/01/22 10:47 AM
                    ISA support for constant count loops: ineffective compared to micro-threadsdmcq2020/01/22 01:31 PM
                    ISA support for constant count loops: ineffective compared to micro-threads2020/01/22 03:28 PM
                      ISA support for constant count loops: ineffective compared to micro-threadsEtienne2020/01/22 04:35 PM
          Interesting comment about rep instructions & code sizeGabriele Svelto2020/01/16 02:00 PM
    Interesting comment about rep instructions & code sizeTravis Downs2020/01/15 03:40 PM
      Interesting comment about rep instructions & code sizeChester2020/01/15 05:16 PM
        Interesting comment about rep instructions & code sizeTravis Downs2020/01/15 05:50 PM
          Interesting comment about rep instructions & code sizeChester2020/01/15 07:24 PM
            Interesting comment about rep instructions & code sizeTravis Downs2020/01/16 02:26 PM
              Interesting comment about rep instructions & code sizeChester2020/01/17 01:16 PM
                Interesting comment about rep instructions & code sizeTravis Downs2020/01/17 03:41 PM
        Interesting comment about rep instructions & code sizeGabriele Svelto2020/01/16 03:53 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell purple?