By: Kevin G (kevin.delete@this.cubitdesigns.com), August 31, 2021 9:39 am
Room: Moderated Discussions
Chester (lamchester.delete@this.gmail.com) on August 30, 2021 1:03 pm wrote:
> -.- (blarg.delete@this.mailinator.com) on August 29, 2021 4:05 am wrote:
> > ARM's upcoming Cortex A510 uses a shared FPU between two cores, so there's
> > at least a second mainstream player trying out shared FPUs:
> >
> >
> >
> > I recall Bulldozer had minimum 2 cycle latency FPU ops, and current
> > ARM chips generally also have minimum 2 cycle latency FPU ops.
>
> In BD's case, that's probably to hit high clock speeds on a pretty bad node. Integer SIMD
> ops are 1c latency on newer AMD CPUs, but probably 2c in Bulldozer because the units are
> half width. Piledriver could do a couple FPU ops (extrq, insertq) with 1c latency.
>
> Sharing an AVX512 unit between 4 little cores may work, in a way similar to Apple AMX,
> A510 SVE, and IBM Telum's AI accelerator. I think Bulldozer's biggest problems were:
>
> - The 16 KB L1D was too small and write-through
> - Slow L2 has to handle a lot of L1D misses
> - The branch predictor was better than K10's, but not quite as good as Intel's at the time
> - Each module half (thread) just wasn't as beefy as a whole Intel core, which could
> bring a lot more OOO resources into play when one SMT thread is in halt.
> - FP execution units were 128 bits wide (256-bit AVX ops decoded into two micro-ops),
> putting it at a disadvantage vs Sandy Bridge's 256-bit wide units
>
> Then to wrap it up, every single bit of ST performance matters for the desktop
> market. Sharing the FPU is pretty far down on the list of BD's problems, IMO.
>
> None of those apply to Gracemont. But Intel would have to solve fairness issues among four
> cores. Also not sure if it'd be easier to just break 512-bit ops into smaller ones.
I would add the shared decoders in Bulldozer were also a bottleneck without a uOp cache. This was fixed in Excavator which added a dedicated set of decoders per thread.
With my arm chair knowledge, I would argue that a shared decoder would have worked it was backed by a uOp cache per thread and the decoder itself was wider.
> -.- (blarg.delete@this.mailinator.com) on August 29, 2021 4:05 am wrote:
> > ARM's upcoming Cortex A510 uses a shared FPU between two cores, so there's
> > at least a second mainstream player trying out shared FPUs:
> >
> >

> >
> > I recall Bulldozer had minimum 2 cycle latency FPU ops, and current
> > ARM chips generally also have minimum 2 cycle latency FPU ops.
>
> In BD's case, that's probably to hit high clock speeds on a pretty bad node. Integer SIMD
> ops are 1c latency on newer AMD CPUs, but probably 2c in Bulldozer because the units are
> half width. Piledriver could do a couple FPU ops (extrq, insertq) with 1c latency.
>
> Sharing an AVX512 unit between 4 little cores may work, in a way similar to Apple AMX,
> A510 SVE, and IBM Telum's AI accelerator. I think Bulldozer's biggest problems were:
>
> - The 16 KB L1D was too small and write-through
> - Slow L2 has to handle a lot of L1D misses
> - The branch predictor was better than K10's, but not quite as good as Intel's at the time
> - Each module half (thread) just wasn't as beefy as a whole Intel core, which could
> bring a lot more OOO resources into play when one SMT thread is in halt.
> - FP execution units were 128 bits wide (256-bit AVX ops decoded into two micro-ops),
> putting it at a disadvantage vs Sandy Bridge's 256-bit wide units
>
> Then to wrap it up, every single bit of ST performance matters for the desktop
> market. Sharing the FPU is pretty far down on the list of BD's problems, IMO.
>
> None of those apply to Gracemont. But Intel would have to solve fairness issues among four
> cores. Also not sure if it'd be easier to just break 512-bit ops into smaller ones.
I would add the shared decoders in Bulldozer were also a bottleneck without a uOp cache. This was fixed in Excavator which added a dedicated set of decoders per thread.
With my arm chair knowledge, I would argue that a shared decoder would have worked it was backed by a uOp cache per thread and the decoder itself was wider.