By: Dummond D. Slow (mental.delete@this.protozoa.us), November 18, 2020 7:17 am
Room: Moderated Discussions
Chester (lamchester.delete@this.gmail.com) on November 18, 2020 7:02 am wrote:
> > 30% figure is an outlier figure for a few workloads. Across
> > the board it's not even close to that. Can we stop
> > declaring SMT an absolute must just because Cinebench has long dependency chains and scales well off of it?
> >
> >
> >
> > Saying stuff like "It's simple: SMT. Had Apple implemented it, it would run away in
> > Cinebench." is just idiotic in the face of the argument you want to make about SMT
> > MT scaling and Apple's reasons and internal decisions about not implementing it.
>
> Sure, Cinebench isn't the best representation of average workloads. But SPEC is far
> worse. No consumer cares about SPEC. The subtests are mostly based off applications
> no one uses, or very specific scientific simulations. It's even useless as a benchmark
> to see whether your system is working properly, because it's so overpriced.
>
> Even when they base a subtest off something a consumer might
> do, the results are hilariously off. For example:
> Encoding a 4K video using ffmpeg libx264 slow preset, on Haswell locked to 2.2 GHz
> - affinity set to 4 threads: 6.6 fps
> - no affinity set: 7.6 fps (1.15x scaling)
You should manually set lower number threads, because if you do this via afinity, the encoder might still spawn the same number of threads. Or perhaps, the best way is actualy switching HT on and off.
x264's thread number is complex, it is not = CPU's threads. It is CPU threads*1.5 because *1 didn't saturate cores/threads due to the way the frame-threading works as opposed to slice threading.
So if you have 4c/4t CPU, 6 threads are spawned. If you are on 4c/8t, 12 threads are spawned. And besides that, input decoding and lookahead are done separately (and the lookahead might be actually slice-threaded too now, not sure, x265 does that, x264 does not).
If you still used 12 threads pinned to 4 logical cores, you might have got a bit different FPS, possibly (not sure if for worse, who knows).
>
> 525.x264_r test: 1.0315x scaling
>
You know I am not completely sure about this as I can't check, but I heard that SPEC uses x264's code WITHOUT SIMD assembly compiled (whether disabling ASM intentionally or not compiling with yasm or whatever, no idea). If that is true it probably grossly changes character of the workload because compiler autovectorization tends to fails on vectorizing multimedia integer algorithms. Normally, x264 spends like 55% (figure given by devs 8-9 years back) in hand-tuned x86 SIMD.
(Interestingly you would think that should push 1 thread to exploit the execution units more fully but it still has gains from SMT/HT...)
> A quick look at the benchmark description shows them using -bitrate 1000.
> If that's 1 kbps (or 1 mbps) bitrate, it's hilariously unrealistic.
it has been long time since I have not used CRF but the bitrate is in kbps because that is the reasonalbe unit, so that is not a problem I think, 1000 kbps is fine (depends on resolution naturaly)
>
> Now take AT Bench's POV-Ray scores for the i5-6600K (1741, 3.6 GHz all core turbo)
> and i7-6700K (2419, 4.2 GHz all core turbo). Scaling down the 6700K's score to account
> for clock speed difference gives 2073. SMT scaling would be roughly 1.19x
>
> 511.povray_r: 1.0291x scaling.
>
> What's going on?
>
> Also, some SPEC numbers make it seem like negative SMT scaling is common. It's not. I've
> personally never seen an application that can use all available threads do worse when
> SMT is enabled. Can we stop looking at the irrelevant pile of garbage that is SPEC?
>
> And what makes you claim "Cinebench has long dependency chains"? How do you know SMT scaling is from that
> rather than hiding cache misses better? Because Cinebench (R20, have not tested R23) does suffer from L1/L2
> cache misses. In ST mode, execution spends 9% of cycles stalled with a L1D miss pending. You get about
> 16.4 L1D MPKI if you count loads hitting the fill buffer as misses. And L2 hitrate is around 50%.
> > 30% figure is an outlier figure for a few workloads. Across
> > the board it's not even close to that. Can we stop
> > declaring SMT an absolute must just because Cinebench has long dependency chains and scales well off of it?
> >
> >

> >
> > Saying stuff like "It's simple: SMT. Had Apple implemented it, it would run away in
> > Cinebench." is just idiotic in the face of the argument you want to make about SMT
> > MT scaling and Apple's reasons and internal decisions about not implementing it.
>
> Sure, Cinebench isn't the best representation of average workloads. But SPEC is far
> worse. No consumer cares about SPEC. The subtests are mostly based off applications
> no one uses, or very specific scientific simulations. It's even useless as a benchmark
> to see whether your system is working properly, because it's so overpriced.
>
> Even when they base a subtest off something a consumer might
> do, the results are hilariously off. For example:
> Encoding a 4K video using ffmpeg libx264 slow preset, on Haswell locked to 2.2 GHz
> - affinity set to 4 threads: 6.6 fps
> - no affinity set: 7.6 fps (1.15x scaling)
You should manually set lower number threads, because if you do this via afinity, the encoder might still spawn the same number of threads. Or perhaps, the best way is actualy switching HT on and off.
x264's thread number is complex, it is not = CPU's threads. It is CPU threads*1.5 because *1 didn't saturate cores/threads due to the way the frame-threading works as opposed to slice threading.
So if you have 4c/4t CPU, 6 threads are spawned. If you are on 4c/8t, 12 threads are spawned. And besides that, input decoding and lookahead are done separately (and the lookahead might be actually slice-threaded too now, not sure, x265 does that, x264 does not).
If you still used 12 threads pinned to 4 logical cores, you might have got a bit different FPS, possibly (not sure if for worse, who knows).
>
> 525.x264_r test: 1.0315x scaling
>
You know I am not completely sure about this as I can't check, but I heard that SPEC uses x264's code WITHOUT SIMD assembly compiled (whether disabling ASM intentionally or not compiling with yasm or whatever, no idea). If that is true it probably grossly changes character of the workload because compiler autovectorization tends to fails on vectorizing multimedia integer algorithms. Normally, x264 spends like 55% (figure given by devs 8-9 years back) in hand-tuned x86 SIMD.
(Interestingly you would think that should push 1 thread to exploit the execution units more fully but it still has gains from SMT/HT...)
> A quick look at the benchmark description shows them using -bitrate 1000.
> If that's 1 kbps (or 1 mbps) bitrate, it's hilariously unrealistic.
it has been long time since I have not used CRF but the bitrate is in kbps because that is the reasonalbe unit, so that is not a problem I think, 1000 kbps is fine (depends on resolution naturaly)
>
> Now take AT Bench's POV-Ray scores for the i5-6600K (1741, 3.6 GHz all core turbo)
> and i7-6700K (2419, 4.2 GHz all core turbo). Scaling down the 6700K's score to account
> for clock speed difference gives 2073. SMT scaling would be roughly 1.19x
>
> 511.povray_r: 1.0291x scaling.
>
> What's going on?
>
> Also, some SPEC numbers make it seem like negative SMT scaling is common. It's not. I've
> personally never seen an application that can use all available threads do worse when
> SMT is enabled. Can we stop looking at the irrelevant pile of garbage that is SPEC?
>
> And what makes you claim "Cinebench has long dependency chains"? How do you know SMT scaling is from that
> rather than hiding cache misses better? Because Cinebench (R20, have not tested R23) does suffer from L1/L2
> cache misses. In ST mode, execution spends 9% of cycles stalled with a L1D miss pending. You get about
> 16.4 L1D MPKI if you count loads hitting the fill buffer as misses. And L2 hitrate is around 50%.