By: Chester (lamchester.delete@this.gmail.com), August 31, 2021 2:58 am
Room: Moderated Discussions
David Kanter (dkanter.delete@this.realworldtech.com) on August 30, 2021 10:29 pm wrote:
> Chester (lamchester.delete@this.gmail.com) on August 30, 2021 1:03 pm wrote:
> > -.- (blarg.delete@this.mailinator.com) on August 29, 2021 4:05 am wrote:
> > > ARM's upcoming Cortex A510 uses a shared FPU between two cores, so there's
> > > at least a second mainstream player trying out shared FPUs:
> > >
> > >
> > >
> > > I recall Bulldozer had minimum 2 cycle latency FPU ops, and current
> > > ARM chips generally also have minimum 2 cycle latency FPU ops.
> >
> > In BD's case, that's probably to hit high clock speeds on a pretty bad node. Integer SIMD
> > ops are 1c latency on newer AMD CPUs, but probably 2c in Bulldozer because the units are
> > half width. Piledriver could do a couple FPU ops (extrq, insertq) with 1c latency.
> >
> > Sharing an AVX512 unit between 4 little cores may work, in a way similar to Apple AMX,
> > A510 SVE, and IBM Telum's AI accelerator. I think Bulldozer's biggest problems were:
> >
> > - The 16 KB L1D was too small and write-through
> > - Slow L2 has to handle a lot of L1D misses
>
> Also, it could only do a single L2 access/clock IIRC. It probably
> needed to be able to do 3 given the write-through.
>
> David
I think it mostly needed to be lower latency. From some testing, PD's L2 can do 16B per cycle, per thread.
Some data from Cinebench R15 in ST mode on a FX-8350:
- L2 accesses/c: 0.04, or about 16% BW utilization
- Store queue full: 3.6% of unhalted cycles (not quite enough to point to a write BW bottleneck?)
- L1D hitrate: 96.4%, 18.44 MPKI (vs 9.3 MPKI on the 3950X)
- L2 data MPKI: 0.65
- BPU accuracy: 93.96%, 8.17 MPKI (vs 96%, 5.15 MPKI on the 3950X)
So there's a lot of L1D misses, very few L2 misses, and rather low IPC at 0.79. L2 latency is a distinct culprit. More testing is needed of course, but I never got around to it.
> Chester (lamchester.delete@this.gmail.com) on August 30, 2021 1:03 pm wrote:
> > -.- (blarg.delete@this.mailinator.com) on August 29, 2021 4:05 am wrote:
> > > ARM's upcoming Cortex A510 uses a shared FPU between two cores, so there's
> > > at least a second mainstream player trying out shared FPUs:
> > >
> > >

> > >
> > > I recall Bulldozer had minimum 2 cycle latency FPU ops, and current
> > > ARM chips generally also have minimum 2 cycle latency FPU ops.
> >
> > In BD's case, that's probably to hit high clock speeds on a pretty bad node. Integer SIMD
> > ops are 1c latency on newer AMD CPUs, but probably 2c in Bulldozer because the units are
> > half width. Piledriver could do a couple FPU ops (extrq, insertq) with 1c latency.
> >
> > Sharing an AVX512 unit between 4 little cores may work, in a way similar to Apple AMX,
> > A510 SVE, and IBM Telum's AI accelerator. I think Bulldozer's biggest problems were:
> >
> > - The 16 KB L1D was too small and write-through
> > - Slow L2 has to handle a lot of L1D misses
>
> Also, it could only do a single L2 access/clock IIRC. It probably
> needed to be able to do 3 given the write-through.
>
> David
I think it mostly needed to be lower latency. From some testing, PD's L2 can do 16B per cycle, per thread.
Some data from Cinebench R15 in ST mode on a FX-8350:
- L2 accesses/c: 0.04, or about 16% BW utilization
- Store queue full: 3.6% of unhalted cycles (not quite enough to point to a write BW bottleneck?)
- L1D hitrate: 96.4%, 18.44 MPKI (vs 9.3 MPKI on the 3950X)
- L2 data MPKI: 0.65
- BPU accuracy: 93.96%, 8.17 MPKI (vs 96%, 5.15 MPKI on the 3950X)
So there's a lot of L1D misses, very few L2 misses, and rather low IPC at 0.79. L2 latency is a distinct culprit. More testing is needed of course, but I never got around to it.