By: Chester (lamchester.delete@this.gmail.com), August 30, 2021 1:03 pm
Room: Moderated Discussions
-.- (blarg.delete@this.mailinator.com) on August 29, 2021 4:05 am wrote:
> ARM's upcoming Cortex A510 uses a shared FPU between two cores, so there's
> at least a second mainstream player trying out shared FPUs:
>
>
>
> I recall Bulldozer had minimum 2 cycle latency FPU ops, and current
> ARM chips generally also have minimum 2 cycle latency FPU ops.
In BD's case, that's probably to hit high clock speeds on a pretty bad node. Integer SIMD ops are 1c latency on newer AMD CPUs, but probably 2c in Bulldozer because the units are half width. Piledriver could do a couple FPU ops (extrq, insertq) with 1c latency.
Sharing an AVX512 unit between 4 little cores may work, in a way similar to Apple AMX, A510 SVE, and IBM Telum's AI accelerator. I think Bulldozer's biggest problems were:
- The 16 KB L1D was too small and write-through
- Slow L2 has to handle a lot of L1D misses
- The branch predictor was better than K10's, but not quite as good as Intel's at the time
- Each module half (thread) just wasn't as beefy as a whole Intel core, which could bring a lot more OOO resources into play when one SMT thread is in halt.
- FP execution units were 128 bits wide (256-bit AVX ops decoded into two micro-ops), putting it at a disadvantage vs Sandy Bridge's 256-bit wide units
Then to wrap it up, every single bit of ST performance matters for the desktop market. Sharing the FPU is pretty far down on the list of BD's problems, IMO.
None of those apply to Gracemont. But Intel would have to solve fairness issues among four cores. Also not sure if it'd be easier to just break 512-bit ops into smaller ones.
> ARM's upcoming Cortex A510 uses a shared FPU between two cores, so there's
> at least a second mainstream player trying out shared FPUs:
>
>

>
> I recall Bulldozer had minimum 2 cycle latency FPU ops, and current
> ARM chips generally also have minimum 2 cycle latency FPU ops.
In BD's case, that's probably to hit high clock speeds on a pretty bad node. Integer SIMD ops are 1c latency on newer AMD CPUs, but probably 2c in Bulldozer because the units are half width. Piledriver could do a couple FPU ops (extrq, insertq) with 1c latency.
Sharing an AVX512 unit between 4 little cores may work, in a way similar to Apple AMX, A510 SVE, and IBM Telum's AI accelerator. I think Bulldozer's biggest problems were:
- The 16 KB L1D was too small and write-through
- Slow L2 has to handle a lot of L1D misses
- The branch predictor was better than K10's, but not quite as good as Intel's at the time
- Each module half (thread) just wasn't as beefy as a whole Intel core, which could bring a lot more OOO resources into play when one SMT thread is in halt.
- FP execution units were 128 bits wide (256-bit AVX ops decoded into two micro-ops), putting it at a disadvantage vs Sandy Bridge's 256-bit wide units
Then to wrap it up, every single bit of ST performance matters for the desktop market. Sharing the FPU is pretty far down on the list of BD's problems, IMO.
None of those apply to Gracemont. But Intel would have to solve fairness issues among four cores. Also not sure if it'd be easier to just break 512-bit ops into smaller ones.