Shared FPU wasn't BD's problem

By: P Snip (hulabaloo.delete@this.gmail.com), August 30, 2021 1:53 pm
Room: Moderated Discussions
Chester (lamchester.delete@this.gmail.com) on August 30, 2021 1:03 pm wrote:
> -.- (blarg.delete@this.mailinator.com) on August 29, 2021 4:05 am wrote:
> > ARM's upcoming Cortex A510 uses a shared FPU between two cores, so there's
> > at least a second mainstream player trying out shared FPUs:
> >
> >
> >
> > I recall Bulldozer had minimum 2 cycle latency FPU ops, and current
> > ARM chips generally also have minimum 2 cycle latency FPU ops.
>
> In BD's case, that's probably to hit high clock speeds on a pretty bad node. Integer SIMD
> ops are 1c latency on newer AMD CPUs, but probably 2c in Bulldozer because the units are
> half width. Piledriver could do a couple FPU ops (extrq, insertq) with 1c latency.
>
> Sharing an AVX512 unit between 4 little cores may work, in a way similar to Apple AMX,
> A510 SVE, and IBM Telum's AI accelerator. I think Bulldozer's biggest problems were:
>
> - The 16 KB L1D was too small and write-through
> - Slow L2 has to handle a lot of L1D misses
> - The branch predictor was better than K10's, but not quite as good as Intel's at the time
> - Each module half (thread) just wasn't as beefy as a whole Intel core, which could
> bring a lot more OOO resources into play when one SMT thread is in halt.
> - FP execution units were 128 bits wide (256-bit AVX ops decoded into two micro-ops),
> putting it at a disadvantage vs Sandy Bridge's 256-bit wide units
>
> Then to wrap it up, every single bit of ST performance matters for the desktop
> market. Sharing the FPU is pretty far down on the list of BD's problems, IMO.
>
> None of those apply to Gracemont. But Intel would have to solve fairness issues among four
> cores. Also not sure if it'd be easier to just break 512-bit ops into smaller ones.

Didn't Agner Fog do a pretty good numb er on BD's shortcomings back in the day?

Off the top of my head:

L2 write through,
Inadequate fetch bandwidth (Move from single to double decoders with Steamroller B, fetch bandwidth found unable to keep up)
*then* you can get onto the narrow pipeline, the length of it, the general horribleness of all the latencies under the sun.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
AVX512 as co-processorMichael S2021/08/29 03:13 AM
  AVX512 as co-processor-.-2021/08/29 04:05 AM
    Shared FPU wasn't BD's problemChester2021/08/30 01:03 PM
      Excellent post (NT)Heikki Kultala2021/08/30 01:34 PM
      Shared FPU wasn't BD's problemP Snip2021/08/30 01:53 PM
      Shared FPU wasn't BD's problem-.-2021/08/30 05:47 PM
      Shared FPU wasn't BD's problemDavid Kanter2021/08/30 10:29 PM
        Shared FPU wasn't BD's problemChester2021/08/31 02:58 AM
          Shared FPU wasn't BD's problemDavid Kanter2021/08/31 09:28 AM
            Shared FPU wasn't BD's problemChester2021/08/31 12:29 PM
            Shared FPU wasn't BD's problemRayla2021/08/31 02:34 PM
      Shared FPU wasn't BD's problemAnon2021/08/31 12:28 AM
        Shared FPU wasn't BD's problemAdrian2021/08/31 01:27 AM
          Shared FPU wasn't BD's problemAnon2021/08/31 02:06 AM
            Shared FPU wasn't BD's problemanonymou52021/08/31 02:09 PM
              Shared FPU wasn't BD's problemChester2021/09/01 11:05 AM
      Shared FPU wasn't BD's problemKevin G2021/08/31 09:39 AM
        Shared FPU wasn't BD's problemChester2021/09/01 10:03 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell tangerine? 🍊