By: Anon (no.delete@this.spam.com), August 31, 2021 12:28 am
Room: Moderated Discussions
Chester (lamchester.delete@this.gmail.com) on August 30, 2021 1:03 pm wrote:
> In BD's case, that's probably to hit high clock speeds on a pretty bad node. Integer SIMD
> ops are 1c latency on newer AMD CPUs, but probably 2c in Bulldozer because the units are
> half width. Piledriver could do a couple FPU ops (extrq, insertq) with 1c latency.
They were 2 cycles because of the FPU design, from Athlon to Bulldozer all AMD FPUs (by FPU I mean: integer SIMD units) had a minimum 2 cycle latency.
> I think Bulldozer's biggest problems were:
>
> - The 16 KB L1D was too small and write-through
> - Slow L2 has to handle a lot of L1D misses
> - The branch predictor was better than K10's, but not quite as good as Intel's at the time
> - Each module half (thread) just wasn't as beefy as a whole Intel core, which could
> bring a lot more OOO resources into play when one SMT thread is in halt.
> - FP execution units were 128 bits wide (256-bit AVX ops decoded into two micro-ops),
> putting it at a disadvantage vs Sandy Bridge's 256-bit wide units
>
> Then to wrap it up, every single bit of ST performance matters for the desktop
> market. Sharing the FPU is pretty far down on the list of BD's problems, IMO.
Let's go into this discussion again, BD was a bad CPU, I think everybody agree here, the problem starts when people try to find "why" and then they point everything that was different in BD as a "bad choice", but hey, BD wasn't bad in every aspect, it's modulo multi-threaded performance was on par with Intel HT core, just the single-threaded performance (which was very important specially on the consumer market) was trrible, so please guys, stop blaming the shared resources (write-trough L2, FPU, decoder, L1I), they were fine at delivering about the same performance of each Intel HT thread, the problem was the non-shared resources which were way too limited for good single-thread performance.
> In BD's case, that's probably to hit high clock speeds on a pretty bad node. Integer SIMD
> ops are 1c latency on newer AMD CPUs, but probably 2c in Bulldozer because the units are
> half width. Piledriver could do a couple FPU ops (extrq, insertq) with 1c latency.
They were 2 cycles because of the FPU design, from Athlon to Bulldozer all AMD FPUs (by FPU I mean: integer SIMD units) had a minimum 2 cycle latency.
> I think Bulldozer's biggest problems were:
>
> - The 16 KB L1D was too small and write-through
> - Slow L2 has to handle a lot of L1D misses
> - The branch predictor was better than K10's, but not quite as good as Intel's at the time
> - Each module half (thread) just wasn't as beefy as a whole Intel core, which could
> bring a lot more OOO resources into play when one SMT thread is in halt.
> - FP execution units were 128 bits wide (256-bit AVX ops decoded into two micro-ops),
> putting it at a disadvantage vs Sandy Bridge's 256-bit wide units
>
> Then to wrap it up, every single bit of ST performance matters for the desktop
> market. Sharing the FPU is pretty far down on the list of BD's problems, IMO.
Let's go into this discussion again, BD was a bad CPU, I think everybody agree here, the problem starts when people try to find "why" and then they point everything that was different in BD as a "bad choice", but hey, BD wasn't bad in every aspect, it's modulo multi-threaded performance was on par with Intel HT core, just the single-threaded performance (which was very important specially on the consumer market) was trrible, so please guys, stop blaming the shared resources (write-trough L2, FPU, decoder, L1I), they were fine at delivering about the same performance of each Intel HT thread, the problem was the non-shared resources which were way too limited for good single-thread performance.