By: Chester (lamchester.delete@this.gmail.com), November 18, 2020 8:02 am
Room: Moderated Discussions
> 30% figure is an outlier figure for a few workloads. Across the board it's not even close to that. Can we stop
> declaring SMT an absolute must just because Cinebench has long dependency chains and scales well off of it?
>
>
>
> Saying stuff like "It's simple: SMT. Had Apple implemented it, it would run away in
> Cinebench." is just idiotic in the face of the argument you want to make about SMT
> MT scaling and Apple's reasons and internal decisions about not implementing it.
Sure, Cinebench isn't the best representation of average workloads. But SPEC is far worse. No consumer cares about SPEC. The subtests are mostly based off applications no one uses, or very specific scientific simulations. It's even useless as a benchmark to see whether your system is working properly, because it's so overpriced.
Even when they base a subtest off something a consumer might do, the results are hilariously off. For example:
Encoding a 4K video using ffmpeg libx264 slow preset, on Haswell locked to 2.2 GHz
- affinity set to 4 threads: 6.6 fps
- no affinity set: 7.6 fps (1.15x scaling)
525.x264_r test: 1.0315x scaling
A quick look at the benchmark description shows them using -bitrate 1000. If that's 1 kbps (or 1 mbps) bitrate, it's hilariously unrealistic.
Now take AT Bench's POV-Ray scores for the i5-6600K (1741, 3.6 GHz all core turbo) and i7-6700K (2419, 4.2 GHz all core turbo). Scaling down the 6700K's score to account for clock speed difference gives 2073. SMT scaling would be roughly 1.19x
511.povray_r: 1.0291x scaling.
What's going on?
Also, some SPEC numbers make it seem like negative SMT scaling is common. It's not. I've personally never seen an application that can use all available threads do worse when SMT is enabled. Can we stop looking at the irrelevant pile of garbage that is SPEC?
And what makes you claim "Cinebench has long dependency chains"? How do you know SMT scaling is from that rather than hiding cache misses better? Because Cinebench (R20, have not tested R23) does suffer from L1/L2 cache misses. In ST mode, execution spends 9% of cycles stalled with a L1D miss pending. You get about 16.4 L1D MPKI if you count loads hitting the fill buffer as misses. And L2 hitrate is around 50%.
> declaring SMT an absolute must just because Cinebench has long dependency chains and scales well off of it?
>
>

>
> Saying stuff like "It's simple: SMT. Had Apple implemented it, it would run away in
> Cinebench." is just idiotic in the face of the argument you want to make about SMT
> MT scaling and Apple's reasons and internal decisions about not implementing it.
Sure, Cinebench isn't the best representation of average workloads. But SPEC is far worse. No consumer cares about SPEC. The subtests are mostly based off applications no one uses, or very specific scientific simulations. It's even useless as a benchmark to see whether your system is working properly, because it's so overpriced.
Even when they base a subtest off something a consumer might do, the results are hilariously off. For example:
Encoding a 4K video using ffmpeg libx264 slow preset, on Haswell locked to 2.2 GHz
- affinity set to 4 threads: 6.6 fps
- no affinity set: 7.6 fps (1.15x scaling)
525.x264_r test: 1.0315x scaling
A quick look at the benchmark description shows them using -bitrate 1000. If that's 1 kbps (or 1 mbps) bitrate, it's hilariously unrealistic.
Now take AT Bench's POV-Ray scores for the i5-6600K (1741, 3.6 GHz all core turbo) and i7-6700K (2419, 4.2 GHz all core turbo). Scaling down the 6700K's score to account for clock speed difference gives 2073. SMT scaling would be roughly 1.19x
511.povray_r: 1.0291x scaling.
What's going on?
Also, some SPEC numbers make it seem like negative SMT scaling is common. It's not. I've personally never seen an application that can use all available threads do worse when SMT is enabled. Can we stop looking at the irrelevant pile of garbage that is SPEC?
And what makes you claim "Cinebench has long dependency chains"? How do you know SMT scaling is from that rather than hiding cache misses better? Because Cinebench (R20, have not tested R23) does suffer from L1/L2 cache misses. In ST mode, execution spends 9% of cycles stalled with a L1D miss pending. You get about 16.4 L1D MPKI if you count loads hitting the fill buffer as misses. And L2 hitrate is around 50%.