Remote atomics

By: matthew (nobody.delete@this.example.com), August 19, 2018 7:58 pm
Room: Moderated Discussions
Travis (travis.downs.delete@this.gmail.com) on August 19, 2018 2:32 pm wrote:
> matthew (nobody.delete@this.example.com) on August 19, 2018 12:51 pm wrote:
> > I think there's a good counter-example to your point, which is the Itanium McKinley implementation
> > of fetchadd. If the line is in L1, the CPU does it. If it's not, the core tells the
> > L2 to do it. Obviously no-one's citing Itanium as a good example of much these days,
> > but I never heard anyone complain about its implementation of atomics.
>
> As far as I can tell McKinley only ever had 1 core, right? So was this some kind
> of optimization for multiple socket machines? The line still had to get back to
> the core, since this is fetch and add, so you always suffer L2 latency?

Yes, the only dual-core-per-socket Itaniums was the Hondo hack which glued two Madisons into a single socket. They didn't even share L3 that I recall. But you have to remember that back in 2002, HP was all about 2-64 socket machines. They weren't optimising for workstations.

> > Let's talk about something that might make sense. A lock is typically on a cacheline which contains other
> > data that is also being operated on. If the lock is uncontended
> > (either no CPU has the cacheline or another
> > CPU has the cacheline in an unlocked state), you want to bring in the cacheline. If the lock is contended,
> > the last thing you want is to steal the cacheline from that CPU; at the very least it wants to access the
> > cacheline again to release the lock, and chances are it's operating on other data in that cacheline.
>
> Definitely, that's one way I can see this working: either with new instructions/hints for the cases the
> developer knows benefit from remote atomics, or some kind of prediction/dynamic behavior. The locking one
> is a good example: right now you have to steal the cache line only to find out that it's locked, maybe
> you could have a type of atomic that leaves the line in place if it is the locked state, since stealing
> it won't be useful anyways. Still, that's not really "remote atomics" because when you do get the lock
> you want to bring it into your cache so that unlock and subsequent lock/unlock pairs are really fast. It's
> more like an optimization of the RFO protocol to embed knowledge of the locking into the probe.

Yes, that's the general direction I'm thinking in.

> Stuff like that doesn't seem to popular though: there is a lot of probably low hanging fruit to
> really optimize locking: for example with some cooperation with the OS scheduler you could avoid
> switching out threads that were in the middle of short, contended critical sections since that ends
> causing a big caravan as all the other threads slam up against the lock before the owning thread
> is scheduled. It seems "easy" to do something like this in a fairly cheap way (e.g,. set a flag
> in the thread control block or something), maybe on an opt-in basis, but it hasn't happened.

Solaris had the ability for a thread to tell the kernel it was holding a mutex and that it shouldn't be preempted until it had dropped the lock. There have been several attempts to add something like that to Linux, but none have succeeded yet. Nothing with hardware assists either.

> > It's fiendishly complicated. It'd almost be better to have a way to transmit a series of operations
> > and have the remote CPU execute them, because there isn't just one way to acquire a lock.
>
> Sure, making it general is tough.
>
> Then you also have to understand what you are saving. The current mechanism is that the waiting
> CPU sends a "message" (a probe) which is snooped by the CPU holding lock, which invalidates its
> copy of the line (without really slowing down), and then the line goes back to the waiting CPU,
> and after that it can perform local operations on it (spinning) until the holding CPU finally
> wants to exit the lock at which point it needs to steal the line back (and then finally the holding
> waiting CPU takes the line back one more time and this time probably gets the lock).

Ah, but the bit you missed is that the cacheline with the lock on it probably also contains data being operated on by the lock owner. So you might well see a flood of coherency traffic as the cacheline transitions from exclusive -> shared -> exclusive -> shared due to all the other CPUs spinning on the cacheline while the owning CPU modifies the cacheline. I believe Intel CPUs (at least) have a hardware optimisation which makes locks more unfair to reduce this traffic, but I know no details.

> Consider sending instructions to a remote CPU to execute. This has the downside of slowing down the very
> CPU that is holding the lock, which is the one agent in the system you don't want to slow down! Also,
> if getting the lock fails, you haven't brought the line back locally, so every time you want to check the
> lock again, you need to interrupt the lock-holding CPU. Add more cores and the problem gets worse.

It depends what your instruction sequence is to acquire the lock. Linux is just "lock cmpxchg" for a spinlock acquire. That's actually fairly easy to handle ... the remote CPU knows whether the cmpxchg succeeded or not and can send the cacheline across:

0000000000000110 :
110: 31 c0 xor %eax,%eax
112: ba 01 00 00 00 mov $0x1,%edx
117: f0 0f b1 17 lock cmpxchg %edx,(%rdi)
11b: 85 c0 test %eax,%eax
11d: 75 01 jne 120
11f: c3 retq

So the requesting CPU only has to send across the contents of %edx and the lock insn. The remote CPU only has to execute the cmpxchg (and that doesn't need much of an ALU!)

Now, that's the fast path. The slow path (ie where the jne leads) does this:

5: ba 01 00 00 00 mov $0x1,%edx
a: 8b 07 mov (%rdi),%eax
c: 85 c0 test %eax,%eax
e: 75 09 jne 19
10: f0 0f b1 17 lock cmpxchg %edx,(%rdi)
14: 85 c0 test %eax,%eax
16: 75 f2 jne a
18: c3 retq
19: f3 90 pause
1b: eb ed jmp a

But if we're going all-out crazy redesign the world, maybe the remote CPU doesn't execute the lock cmpxchg. Maybe we just give the CPU designers one 32-bit word in our data structures and we just have a "spinlock" instruction which ends up being implemented as "the remote CPU does not return the cacheline until you won the lock", and the CPU can use the contents of that word for whatever fair queueing algorithm it wants to use.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
ARM turns to a god and a heroAM2018/08/16 09:32 AM
  ARM turns to a god and a heroMaynard Handley2018/08/16 09:41 AM
    ARM turns to a god and a heroDoug S2018/08/16 11:11 AM
    ARM turns to a god and a heroGeoff Langdale2018/08/16 11:59 PM
      ARM turns to a god and a herodmcq2018/08/17 05:12 AM
  ARM is somewhat misleadingAdrian2018/08/16 11:56 PM
    It's marketing materialGabriele Svelto2018/08/17 01:00 AM
      It's marketing materialMichael S2018/08/17 03:13 AM
        It's marketing materialdmcq2018/08/17 05:23 AM
          It's marketing materialAndrei Frumusanu2018/08/17 07:25 AM
        It's marketing materialLinus Torvalds2018/08/17 11:20 AM
          It's marketing materialGroo2018/08/17 01:44 PM
            It's marketing materialDoug S2018/08/17 02:14 PM
          promises and deliveriesAM2018/08/17 02:32 PM
            promises and deliveriesPassing Through2018/08/17 03:02 PM
              Just by way of clarification Passing Through2018/08/17 03:15 PM
                Just by way of clarification AM2018/08/18 12:49 PM
                  Just by way of clarification Passing Through2018/08/18 01:34 PM
                    This ain't the nineties any longerPassing Through2018/08/18 01:54 PM
                      This ain't the nineties any longerMaynard Handley2018/08/18 02:50 PM
                        This ain't the nineties any longerPassing Through2018/08/18 03:57 PM
                          This ain't the nineties any longerPassing Through2018/09/06 02:42 PM
                            This ain't the nineties any longerMaynard Handley2018/09/07 04:10 PM
                              This ain't the nineties any longerPassing Through2018/09/07 04:48 PM
                                This ain't the nineties any longerMaynard Handley2018/09/07 05:22 PM
                Just by way of clarification Wilco2018/08/18 01:26 PM
                  Just by way of clarification Passing Through2018/08/18 01:39 PM
                  Just by way of clarification none2018/08/18 10:52 PM
                    Just by way of clarification dmcq2018/08/19 08:32 AM
                      Just by way of clarification none2018/08/19 08:54 AM
                        Just by way of clarification dmcq2018/08/19 11:24 AM
                          Just by way of clarification none2018/08/19 11:52 AM
                  Just by way of clarification Gabriele Svelto2018/08/19 06:41 AM
                    Just by way of clarification Passing Through2018/08/19 09:25 AM
                      Whiteboards at Gatwick airport anyone? Passing Through2018/08/20 04:24 AM
          It's marketing materialMichael S2018/08/18 11:12 AM
          It's marketing materialBrett2018/08/18 05:22 PM
            It's marketing materialBrett2018/08/18 05:33 PM
              It's marketing materialAdrian2018/08/19 01:21 AM
        A76AM2018/08/17 02:45 PM
          A76Michael S2018/08/18 11:20 AM
            A76AM2018/08/18 12:39 PM
              A76Michael S2018/08/18 12:49 PM
                A76AM2018/08/18 01:06 PM
                  A76Doug S2018/08/18 01:43 PM
                    A76Maynard Handley2018/08/18 02:42 PM
                      A76Maynard Handley2018/08/18 04:22 PM
                        Why write zeros when one can use metadata?Paul A. Clayton2018/08/18 06:19 PM
                          Why write zeros when one can use metadata?Maynard Handley2018/08/19 11:12 AM
                            Dictionary compress might apply to memcopyPaul A. Clayton2018/08/19 01:45 PM
                        Instructions for zeroingKonrad Schwarz2018/08/30 06:37 AM
                          Instructions for zeroingMaynard Handley2018/08/30 08:41 AM
                          Instructions for zeroingAdrian2018/08/30 11:37 AM
                            dcbz -> dcbzl (was: Instructions for zeroing)hobold2018/08/31 01:50 AM
                              dcbz -> dcbzl (was: Instructions for zeroing)dmcq2018/09/01 05:28 AM
                      A76Travis2018/08/19 11:36 AM
                        A76Maynard Handley2018/08/19 12:22 PM
                          A76Travis2018/08/19 02:07 PM
                            A76Maynard Handley2018/08/19 06:24 PM
                        Remote atomicsmatthew2018/08/19 12:51 PM
                          Remote atomicsMichael S2018/08/19 01:58 PM
                            Remote atomicsmatthew2018/08/19 02:32 PM
                              Remote atomicsMichael S2018/08/19 02:36 PM
                                Remote atomicsmatthew2018/08/19 02:48 PM
                                  Remote atomicsMichael S2018/08/19 03:16 PM
                                    Remote atomicsRicardo B2018/08/20 10:05 AM
                            Remote atomicsdmcq2018/08/19 02:33 PM
                          Remote atomicsTravis2018/08/19 02:32 PM
                            Remote atomicsMichael S2018/08/19 02:46 PM
                              Remote atomicsTravis2018/08/19 05:35 PM
                                Remote atomicsMichael S2018/08/20 03:29 AM
                            Remote atomicsmatthew2018/08/19 07:58 PM
                              Remote atomicsanon2018/08/20 12:59 AM
                                Remote atomicsTravis2018/08/20 10:26 AM
                              Remote atomicsTravis2018/08/20 09:57 AM
                              Remote atomicsLinus Torvalds2018/08/20 04:29 PM
                                Fitting time slices to execution phasesPaul A. Clayton2018/08/21 09:09 AM
                                  Fitting time slices to execution phasesLinus Torvalds2018/08/21 02:34 PM
                                    Fitting time slices to execution phasesLinus Torvalds2018/08/21 03:31 PM
                                      Fitting time slices to execution phasesGabriele Svelto2018/08/21 03:54 PM
                                        Fitting time slices to execution phasesLinus Torvalds2018/08/21 04:26 PM
                                      Fitting time slices to execution phasesTravis2018/08/21 04:21 PM
                                        Fitting time slices to execution phasesLinus Torvalds2018/08/21 04:39 PM
                                          Fitting time slices to execution phasesTravis2018/08/21 04:59 PM
                                            Fitting time slices to execution phasesLinus Torvalds2018/08/21 05:13 PM
                                      Fitting time slices to execution phasesanon2018/08/21 04:27 PM
                                        Fitting time slices to execution phasesLinus Torvalds2018/08/21 06:02 PM
                                          Fitting time slices to execution phasesEtienne2018/08/22 02:28 AM
                                        Fitting time slices to execution phasesGabriele Svelto2018/08/22 03:07 PM
                                          Fitting time slices to execution phasesTravis2018/08/22 04:00 PM
                                          Fitting time slices to execution phasesanon2018/08/22 06:52 PM
                                    Fitting time slices to execution phasesTravis2018/08/21 04:37 PM
                                    Is preventing misuse that complex?Paul A. Clayton2018/08/23 05:42 AM
                                      Is preventing misuse that complex?Linus Torvalds2018/08/23 12:46 PM
                                        Is preventing misuse that complex?Travis2018/08/23 01:29 PM
                                          Is preventing misuse that complex?Travis2018/08/23 01:33 PM
                                            Is preventing misuse that complex?Jeff S.2018/08/24 07:57 AM
                                              Is preventing misuse that complex?Travis2018/08/24 08:47 AM
                                          Is preventing misuse that complex?Linus Torvalds2018/08/23 02:30 PM
                                            Is preventing misuse that complex?Travis2018/08/23 03:11 PM
                                              Is preventing misuse that complex?Linus Torvalds2018/08/24 01:00 PM
                                                Is preventing misuse that complex?Gabriele Svelto2018/08/24 01:25 PM
                                                  Is preventing misuse that complex?Linus Torvalds2018/08/24 01:33 PM
                                  Fitting time slices to execution phasesTravis2018/08/21 03:54 PM
                                rseq: holy grail rwlock?Travis2018/08/21 03:18 PM
                                  rseq: holy grail rwlock?Linus Torvalds2018/08/21 03:59 PM
                                    rseq: holy grail rwlock?Travis2018/08/21 04:27 PM
                                      rseq: holy grail rwlock?Linus Torvalds2018/08/21 05:10 PM
                                        rseq: holy grail rwlock?Travis2018/08/21 06:21 PM
                  ARM design housesMichael S2018/08/21 05:07 AM
                    ARM design housesWilco2018/08/22 12:38 PM
                      ARM design housesMichael S2018/08/22 02:21 PM
                        ARM design housesWilco2018/08/22 03:23 PM
                          ARM design housesMichael S2018/08/29 01:58 AM
                            Qualcomm's core naming scheme really, really sucksHeikki Kultala2018/08/29 02:19 AM
                A76Maynard Handley2018/08/18 02:07 PM
                  A76Michael S2018/08/18 02:32 PM
                    A76Maynard Handley2018/08/18 02:52 PM
                      A76Michael S2018/08/18 03:04 PM
    ARM is somewhat misleadingjuanrga2018/08/17 01:20 AM
    Surprised??Alberto2018/08/17 01:52 AM
      Surprised??Alberto2018/08/17 02:10 AM
      Surprised??none2018/08/17 02:46 AM
      Garbage talkAndrei Frumusanu2018/08/17 07:30 AM
        Garbage talkMichael S2018/08/17 07:43 AM
          Garbage talkAndrei Frumusanu2018/08/17 09:51 AM
            Garbage talkMichael S2018/08/18 11:29 AM
        Garbage talkAdrian2018/08/17 08:28 AM
          Garbage talkAlberto2018/08/17 09:20 AM
          Garbage talkAndrei Frumusanu2018/08/17 09:48 AM
            Garbage talkAdrian2018/08/17 10:17 AM
              Garbage talkAndrei Frumusanu2018/08/17 10:36 AM
                Garbage talkAdrian2018/08/17 02:53 PM
                  Garbage talkAndrei Frumusanu2018/08/18 12:17 AM
        More like a religion he?? ARM has an easy life :)Alberto2018/08/17 09:13 AM
          More like a religion he?? ARM has an easy life :)Andrei Frumusanu2018/08/17 09:34 AM
            More like a religion he?? ARM has an easy life :)Alberto2018/08/17 10:03 AM
              More like a religion he?? ARM has an easy life :)Andrei Frumusanu2018/08/17 10:43 AM
              More like a religion he?? ARM has an easy life :)Doug S2018/08/17 02:17 PM
              15W phone SoCsAM2018/08/17 03:04 PM
          More like a religion he?? ARM has an easy life :)Maynard Handley2018/08/17 12:29 PM
  my future stuff will be better than your old stuff, hey I'm a god at last (NT)Eric Bron2018/08/18 03:34 AM
    my future stuff will be better than your old stuff, hey I'm a god at lastnone2018/08/18 08:34 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell avocado?