By: David Kanter (dkanter.delete@this.realworldtech.com), August 30, 2021 10:29 pm
Room: Moderated Discussions
Chester (lamchester.delete@this.gmail.com) on August 30, 2021 1:03 pm wrote:
> -.- (blarg.delete@this.mailinator.com) on August 29, 2021 4:05 am wrote:
> > ARM's upcoming Cortex A510 uses a shared FPU between two cores, so there's
> > at least a second mainstream player trying out shared FPUs:
> >
> >
> >
> > I recall Bulldozer had minimum 2 cycle latency FPU ops, and current
> > ARM chips generally also have minimum 2 cycle latency FPU ops.
>
> In BD's case, that's probably to hit high clock speeds on a pretty bad node. Integer SIMD
> ops are 1c latency on newer AMD CPUs, but probably 2c in Bulldozer because the units are
> half width. Piledriver could do a couple FPU ops (extrq, insertq) with 1c latency.
>
> Sharing an AVX512 unit between 4 little cores may work, in a way similar to Apple AMX,
> A510 SVE, and IBM Telum's AI accelerator. I think Bulldozer's biggest problems were:
>
> - The 16 KB L1D was too small and write-through
> - Slow L2 has to handle a lot of L1D misses
Also, it could only do a single L2 access/clock IIRC. It probably needed to be able to do 3 given the write-through.
David
> -.- (blarg.delete@this.mailinator.com) on August 29, 2021 4:05 am wrote:
> > ARM's upcoming Cortex A510 uses a shared FPU between two cores, so there's
> > at least a second mainstream player trying out shared FPUs:
> >
> >

> >
> > I recall Bulldozer had minimum 2 cycle latency FPU ops, and current
> > ARM chips generally also have minimum 2 cycle latency FPU ops.
>
> In BD's case, that's probably to hit high clock speeds on a pretty bad node. Integer SIMD
> ops are 1c latency on newer AMD CPUs, but probably 2c in Bulldozer because the units are
> half width. Piledriver could do a couple FPU ops (extrq, insertq) with 1c latency.
>
> Sharing an AVX512 unit between 4 little cores may work, in a way similar to Apple AMX,
> A510 SVE, and IBM Telum's AI accelerator. I think Bulldozer's biggest problems were:
>
> - The 16 KB L1D was too small and write-through
> - Slow L2 has to handle a lot of L1D misses
Also, it could only do a single L2 access/clock IIRC. It probably needed to be able to do 3 given the write-through.
David