Apple M1 TSO + RefCounting optimizations

By: anon2 (anon.delete@this.anon.com), November 25, 2020 5:18 pm
Room: Moderated Discussions
dmcq (dmcq.delete@this.fano.co.uk) on November 25, 2020 1:33 pm wrote:
> gpd (gpd.delete@this.gpd.com) on November 25, 2020 3:25 am wrote:
> > nksingh (None.delete@this.none.non) on November 24, 2020 6:24 pm wrote:
> > > An interesting thing I saw today was that Apple's M1 performance cores have a mode flag
> > > to enable TSO on all memory accesses: https://github.com/saagarjha/TSOEnabler.
> > >
> > > This is one of the largest factors affecting performance of emulating X86 for Windows on ARM64.
> > > I wonder what having this flag does to their hardware? It would be interesting to see how
> > > store forwarding and memory disambiguation tests change when the flag is on and off.
> > >
> > > Hopefully ARM will also release cores with TSO mode. It would
> > > probably make managed languages run a little faster.
> > >
> > > There were also claims floating around today about Apple doing something to optimize reference
> > > counting to help their language runtime: https://twitter.com/Catfish_Man/status/1326238434235568128.
> > > Hopefully someone will post the code sequence that they're using on the M1. "Retain" doesn't
> > > need any fences, but "Release" should require a release fence. Maybe the TSO hardware helps speed
> > > this case up too. Or maybe they did something even more specific to refcounting.
> > >
> > > I think these optimizations show that Apple's work is very much in line with Linus' continual
> > > refrain that hardware should optimize for common system-level patterns. If refcounting is fast
> > > and doesn't introduce excess fencing, the motivation for something like RCU might be smaller.
> > >
> > > I'm looking forward to people trying out code sequences on
> > > the M1 to see how advanced the memory subsystem really is.
> >
> > One issue with refcounting is that even read-only sharing requires writing to the refcount cacheline
> > with all the implied multi-threading scalability issues.
> > Maybe they have some specialized remote-refcount-update
> > instruction that doesn't require acquiring the cacheline in exclusive mode?
> >
> > But most likely they just optimized the very common non-threaded mode and just have very cheap release RMW.
>
> They probably have implemented the STADD atomic instructions in a way that
> they can update a value in another processors cache. These are fire and forget
> instructions - tthe processor doesn't need to know what the sum is.
>
>

That does not sound probable at all.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Apple M1 TSO + RefCounting optimizationsnksingh2020/11/24 06:24 PM
  Apple M1 TSO + RefCounting optimizationsgpd2020/11/25 03:25 AM
    Apple M1 TSO + RefCounting optimizationsdmcq2020/11/25 01:33 PM
      Apple M1 TSO + RefCounting optimizationsnksingh2020/11/25 02:41 PM
        Apple M1 TSO + RefCounting optimizationsdmcq2020/11/25 03:17 PM
      Apple M1 TSO + RefCounting optimizationsanon22020/11/25 05:18 PM
        Apple M1 TSO + RefCounting optimizationsNksingh2020/11/27 09:22 PM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell avocado?