Shared FPU wasn't BD's problem

By: Chester (lamchester.delete@this.gmail.com), August 31, 2021 12:29 pm
Room: Moderated Discussions
David Kanter (dkanter.delete@this.realworldtech.com) on August 31, 2021 9:28 am wrote:
> Chester (lamchester.delete@this.gmail.com) on August 31, 2021 2:58 am wrote:
> > David Kanter (dkanter.delete@this.realworldtech.com) on August 30, 2021 10:29 pm wrote:
> > > Chester (lamchester.delete@this.gmail.com) on August 30, 2021 1:03 pm wrote:
> > > > -.- (blarg.delete@this.mailinator.com) on August 29, 2021 4:05 am wrote:
> > > > > ARM's upcoming Cortex A510 uses a shared FPU between two cores, so there's
> > > > > at least a second mainstream player trying out shared FPUs:
> > > > >
> > > > >
> > > > >
> > > > > I recall Bulldozer had minimum 2 cycle latency FPU ops, and current
> > > > > ARM chips generally also have minimum 2 cycle latency FPU ops.
> > > >
> > > > In BD's case, that's probably to hit high clock speeds on a pretty bad node. Integer SIMD
> > > > ops are 1c latency on newer AMD CPUs, but probably 2c in Bulldozer because the units are
> > > > half width. Piledriver could do a couple FPU ops (extrq, insertq) with 1c latency.
> > > >
> > > > Sharing an AVX512 unit between 4 little cores may work, in a way similar to Apple AMX,
> > > > A510 SVE, and IBM Telum's AI accelerator. I think Bulldozer's biggest problems were:
> > > >
> > > > - The 16 KB L1D was too small and write-through
> > > > - Slow L2 has to handle a lot of L1D misses
> > >
> > > Also, it could only do a single L2 access/clock IIRC. It probably
> > > needed to be able to do 3 given the write-through.
> > >
> > > David
> >
> > I think it mostly needed to be lower latency. From some testing, PD's L2 can do 16B per cycle, >per thread.
>
> Yes, but I'm saying needed to be 2x16B/clock for write-through.
>
> Also the L2 needs to service two instruction caches that were tiny.

Yeah, I counted that in below in L2 accesses/c. I had the IC fill, DC fill, TLB fill (page table walks), and L2 prefetcher request unit masks selected.

The FX-8350's instruction cache actually wasn't tiny, at 64 KB. With 96.3% hitrate and 7.26 MPKI, it was better than Zen 2's 32 KB L1i (13.76 MPKI - hitrate was somehow higher at 96.83%).

> > Some data from Cinebench R15 in ST mode on a FX-8350:
> > - L2 accesses/c: 0.04, or about 16% BW utilization
> > - Store queue full: 3.6% of unhalted cycles (not quite enough to point to a write BW bottleneck?)
> > - L1D hitrate: 96.4%, 18.44 MPKI (vs 9.3 MPKI on the 3950X)
> > - L2 data MPKI: 0.65
> > - BPU accuracy: 93.96%, 8.17 MPKI (vs 96%, 5.15 MPKI on the 3950X)
> >
> > So there's a lot of L1D misses, very few L2 misses, and rather low IPC at 0.79. L2 latency
> > is a distinct culprit. More testing is needed of course, but I never got around to it.
>
> Cinebench is a tiny working set. What about real apps with large I and D footprints?
>
> David

Surprisingly it's not that small - enough to spill out of Zen 2's 512 KB L2 on both the instruction and data side. The 3950X had 1.87 L2 code read MPKI and 4.54 L2 data read MPKI. Total L2 MPKI on the FX-8350 was 2.25.

Cinebench's biggest problem is that it doesn't spill out of L3 enough. Still, the instruction footprint is bigger than the vast majority of Geekbench or spec workloads, and the data footprint is at least good enough to get past L2 caches. That destroys some CPUs with smaller caches like Jaguar.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
AVX512 as co-processorMichael S2021/08/29 03:13 AM
  AVX512 as co-processor-.-2021/08/29 04:05 AM
    Shared FPU wasn't BD's problemChester2021/08/30 01:03 PM
      Excellent post (NT)Heikki Kultala2021/08/30 01:34 PM
      Shared FPU wasn't BD's problemP Snip2021/08/30 01:53 PM
      Shared FPU wasn't BD's problem-.-2021/08/30 05:47 PM
      Shared FPU wasn't BD's problemDavid Kanter2021/08/30 10:29 PM
        Shared FPU wasn't BD's problemChester2021/08/31 02:58 AM
          Shared FPU wasn't BD's problemDavid Kanter2021/08/31 09:28 AM
            Shared FPU wasn't BD's problemChester2021/08/31 12:29 PM
            Shared FPU wasn't BD's problemRayla2021/08/31 02:34 PM
      Shared FPU wasn't BD's problemAnon2021/08/31 12:28 AM
        Shared FPU wasn't BD's problemAdrian2021/08/31 01:27 AM
          Shared FPU wasn't BD's problemAnon2021/08/31 02:06 AM
            Shared FPU wasn't BD's problemanonymou52021/08/31 02:09 PM
              Shared FPU wasn't BD's problemChester2021/09/01 11:05 AM
      Shared FPU wasn't BD's problemKevin G2021/08/31 09:39 AM
        Shared FPU wasn't BD's problemChester2021/09/01 10:03 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell tangerine? ūüćä