Interesting comment about rep instructions & code size

By: Travis Downs (travis.downs.delete@this.gmail.com), January 17, 2020 3:41 pm
Room: Moderated Discussions
Chester (lamchester.delete@this.gmail.com) on January 17, 2020 12:16 pm wrote:
> > > Small size behavior's gonna be bad anyway if a call to memcpy incurs
> > > an icache miss and maybe a couple branch mispredicts afterward.
> >
> > Eh, sure - but you can can say that for any X:
> >
> > > Small size behavior's gonna be bad anyway if a call to X incurs
> > > an icache miss and maybe a couple branch mispredicts afterward.
> >
> > Why would you assume that this would be the case? We are of course also interested in cases where you
> > don't an icache miss, and branch mispredicts afterwards, and of course these cases will be very common.
> > In particular, in cases where memcpy is a significant portion the runtime (where its performance matters
> > relative more), we don't expect either of those things to be common (the mispredict *inside* memcpy
> > is of sometimes unavoidable if the length isn't predictable, even after "quantization").
>
> What workloads have memcpy as a significant portion of runtime? In stuff I've profiled, memcpy has taken
> 1% (or less) of unhalted cycles. It's like 0.5% in the case of Firefox, and 0.3% for code compilation.

Lots, I guess. The mem* routines and a few string routines like strln are famous for being ones that often eat lots of CPU time, which is why they receive so much optimization focus, and various runtimes have like 20 bespoke asm versions of them for different platforms and ISA extensions.

Obviously due to inlining and runtime dispatching, it's not always "memcpy" that shows up in the filing either.

Certainly, I regularly profile things that have actual memcpy or "memcpy like" things that take 10% or more of the runtime. Often the more you optimize the core processing you are doing, the memcpy that was always there but didn't matter becomes a longer part of the processing time.

It's not weird to have something that spends little time in memcpy, and I imagine big complicated branchy things firefox and compilers are good examples. These people shouldn't care much about memcpy at all, as long as it doesn't mess up their icache.

>
> Also looking closer, memcpy in code compilation (sandy bridge) spends almost half its cycles stalled
> on a full store buffer. So FE-related optimization potential there is already limited.
>
> In Firefox/webxprt3 on Haswell, the vcruntime memcpy ends up on rep movsb a lot. VTune says ~28% of
> memcpy's uops come from the microcode sequencer, and ms_uops/ms_cycles = 3.55. It doesn't really suffer
> from icache misses and "only" spends 20% of cycles stalled on a full store buffer. Unlike the previous
> SnB example, it also gets better branch prediction accuracy (~97% vs ~95% fwiw). Not sure what all that
> means just yet. It could just be rep movsb inefficiency making things less backend bound.

The icache miss impact can be worse on subsequent code than than the memcpy code. THe first order approximation is the cost is split 50/50: every icache miss that is taken in memcpy also evicts a line that means 1 more miss that you wouldn't otherwise have taken in subsequent code. However, memcpy is one function with a more or less linear code layout, so taking say 5 misses all in a linear pattern is probably cheaper than 5 misses at essentially arbitrary locations in subsequent code.

Google released a good paper lately talking about memcmp and it's icache impact, and how, for them, smaller, slower memcmp was better. However the memcmp they used was really obscene, like over 5,000 bytes actually used (i.e., they are counting the executed icache lines, not just the whole function most of which might be cold). So although I believe it was better for them, their baseline was obviously quite silly: no reason for a >5000 byte memcmp in the first place.

>
> > > Would you happen to know the rep movsb startup time on modern architectures like Skylake?
> >
> > Back to back it looks like about 24 cycles for a 1 byte
> > copy on Skylake, from uops.info. it takes 67 uops. So
> > it's doing a lot of work. 0 byte copies are fewer uops but about twice as slow, which is kind of weird: you
> > don't intentionally copy 0 bytes but of course it comes up in practice all the type with dynamic length.
> >
> > Ice Lake is much better.
>
> Huh. Going by https://uops.info/html-instr/MOVSB_REPE.html, Conroe has a (reciprocal?) throughput
> of 12 cycles, beating out even Ice Lake? And spits out fewer uops than SKL/HSW?

Could be. The uarch has changed a lot. I have no doubt SKL/HSW are better for longer copies, but they suck at shorter ones. You should also take the uarch bench numbers with a grain of salt, there is always the possibility of a testing artifact with special instructions when the numbers seem high (they are worse than I remembered).
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
LLVM comments on mem*Maynard Handley2020/01/14 01:51 PM
  LLVM comments on mem*Anon32020/01/15 06:28 AM
  Interesting comment about rep instructions & code sizeGabriele Svelto2020/01/15 07:12 AM
    Interesting comment about rep instructions & code sizenone2020/01/15 08:59 AM
      Interesting comment about rep instructions & code sizeGabriele Svelto2020/01/16 03:56 AM
        Interesting comment about rep instructions & code sizeLinus Torvalds2020/01/16 10:12 AM
          ISA support for constant count loopsPaul A. Clayton2020/01/16 11:28 AM
            ISA support for constant count loopsGabriele Svelto2020/01/16 02:15 PM
              PowerPC "front-end registers"Paul A. Clayton2020/01/16 03:34 PM
              ISA support for constant count loopsTravis Downs2020/01/16 05:21 PM
                ISA support for constant count loopsLinus Torvalds2020/01/16 08:41 PM
                  ISA support for constant count loopsTravis2020/01/16 09:48 PM
                    ISA support for constant count loopsBrett2020/01/17 01:28 AM
              Branch to CTRMaya2020/01/18 08:15 AM
                Branch to CTRGabriele Svelto2020/01/18 01:14 PM
            ISA support for constant count loopsanon2020/01/17 08:28 AM
              ISA support for constant count loopsTravis Downs2020/01/17 08:34 AM
            ISA support for constant count loops: ineffective compared to micro-threads2020/01/20 08:02 AM
              ISA support for constant count loops: ineffective compared to micro-threadssomeone2020/01/20 12:23 PM
                ISA support for constant count loops: ineffective compared to micro-threads2020/01/22 09:23 AM
              ISA support for too slow computersEtienne2020/01/21 02:42 AM
                ISA support for constant count loops: ineffective compared to micro-threads2020/01/22 09:18 AM
                  ISA support for constant count loops: ineffective compared to micro-threads2020/01/22 10:04 AM
                  ISA support for constant count loops: ineffective compared to micro-threadsHeikki Kultala2020/01/22 10:47 AM
                    ISA support for constant count loops: ineffective compared to micro-threadsdmcq2020/01/22 01:31 PM
                    ISA support for constant count loops: ineffective compared to micro-threads2020/01/22 03:28 PM
                      ISA support for constant count loops: ineffective compared to micro-threadsEtienne2020/01/22 04:35 PM
          Interesting comment about rep instructions & code sizeGabriele Svelto2020/01/16 02:00 PM
    Interesting comment about rep instructions & code sizeTravis Downs2020/01/15 03:40 PM
      Interesting comment about rep instructions & code sizeChester2020/01/15 05:16 PM
        Interesting comment about rep instructions & code sizeTravis Downs2020/01/15 05:50 PM
          Interesting comment about rep instructions & code sizeChester2020/01/15 07:24 PM
            Interesting comment about rep instructions & code sizeTravis Downs2020/01/16 02:26 PM
              Interesting comment about rep instructions & code sizeChester2020/01/17 01:16 PM
                Interesting comment about rep instructions & code sizeTravis Downs2020/01/17 03:41 PM
        Interesting comment about rep instructions & code sizeGabriele Svelto2020/01/16 03:53 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell purple?