By: Chester (lamchester.delete@this.gmil.com), November 18, 2020 3:06 pm
Room: Moderated Discussions
Dummond D. Slow (mental.delete@this.protozoa.us) on November 18, 2020 7:17 am wrote:
> Chester (lamchester.delete@this.gmail.com) on November 18, 2020 7:02 am wrote:
> > > 30% figure is an outlier figure for a few workloads. Across
> > > the board it's not even close to that. Can we stop
> > > declaring SMT an absolute must just because Cinebench has long dependency chains and scales well off of it?
> > >
> > >
> > >
> > > Saying stuff like "It's simple: SMT. Had Apple implemented it, it would run away in
> > > Cinebench." is just idiotic in the face of the argument you want to make about SMT
> > > MT scaling and Apple's reasons and internal decisions about not implementing it.
> >
> > Sure, Cinebench isn't the best representation of average workloads. But SPEC is far
> > worse. No consumer cares about SPEC. The subtests are mostly based off applications
> > no one uses, or very specific scientific simulations. It's even useless as a benchmark
> > to see whether your system is working properly, because it's so overpriced.
> >
> > Even when they base a subtest off something a consumer might
> > do, the results are hilariously off. For example:
> > Encoding a 4K video using ffmpeg libx264 slow preset, on Haswell locked to 2.2 GHz
> > - affinity set to 4 threads: 6.6 fps
> > - no affinity set: 7.6 fps (1.15x scaling)
>
> You should manually set lower number threads, because if you do this via afinity, the encoder might still
> spawn the same number of threads. Or perhaps, the best way is actualy switching HT on and off.
> x264's thread number is complex, it is not = CPU's threads. It is CPU threads*1.5 because *1 didn't
> saturate cores/threads due to the way the frame-threading works as opposed to slice threading.
> So if you have 4c/4t CPU, 6 threads are spawned. If you are on 4c/8t, 12 threads are spawned.
> And besides that, input decoding and lookahead are done separately (and the lookahead might
> be actually slice-threaded too now, not sure, x265 does that, x264 does not).
> If you still used 12 threads pinned to 4 logical cores, you might have
> got a bit different FPS, possibly (not sure if for worse, who knows).
Ok, I'll try this sometime.
>
> >
> > 525.x264_r test: 1.0315x scaling
> >
>
> You know I am not completely sure about this as I can't check, but I heard that SPEC uses x264's
> code WITHOUT SIMD assembly compiled (whether disabling ASM intentionally or not compiling with yasm
> or whatever, no idea). If that is true it probably grossly changes character of the workload because
> compiler autovectorization tends to fails on vectorizing multimedia integer algorithms. Normally,
> x264 spends like 55% (figure given by devs 8-9 years back) in hand-tuned x86 SIMD.
>
> (Interestingly you would think that should push 1 thread to exploit the
> execution units more fully but it still has gains from SMT/HT...)
>
>
> > A quick look at the benchmark description shows them using -bitrate 1000.
> > If that's 1 kbps (or 1 mbps) bitrate, it's hilariously unrealistic.
>
> it has been long time since I have not used CRF but the bitrate is in kbps because that is the reasonalbe
> unit, so that is not a problem I think, 1000 kbps is fine (depends on resolution naturaly)
1000 kbps is way too low for any res today. Youtube isn't exactly the champion of quality. They recommend 5 mbps for 720p30, and 8 mbps for 1080p30. You can push a bit lower if you need to save every last bit, but I've never liked the results for say, 3.5 mbps w/720p30. 1000 kbps is ridiculous.
> > Now take AT Bench's POV-Ray scores for the i5-6600K (1741, 3.6 GHz all core turbo)
> > and i7-6700K (2419, 4.2 GHz all core turbo). Scaling down the 6700K's score to account
> > for clock speed difference gives 2073. SMT scaling would be roughly 1.19x
> >
> > 511.povray_r: 1.0291x scaling.
> >
> > What's going on?
> >
> > Also, some SPEC numbers make it seem like negative SMT scaling is common. It's not. I've
> > personally never seen an application that can use all available threads do worse when
> > SMT is enabled. Can we stop looking at the irrelevant pile of garbage that is SPEC?
> >
> > And what makes you claim "Cinebench has long dependency chains"? How do you know SMT scaling is from that
> > rather than hiding cache misses better? Because Cinebench (R20, have not tested R23) does suffer from L1/L2
> > cache misses. In ST mode, execution spends 9% of cycles stalled with a L1D miss pending. You get about
> > 16.4 L1D MPKI if you count loads hitting the fill buffer as misses. And L2 hitrate is around 50%.
>
>
> Chester (lamchester.delete@this.gmail.com) on November 18, 2020 7:02 am wrote:
> > > 30% figure is an outlier figure for a few workloads. Across
> > > the board it's not even close to that. Can we stop
> > > declaring SMT an absolute must just because Cinebench has long dependency chains and scales well off of it?
> > >
> > >

> > >
> > > Saying stuff like "It's simple: SMT. Had Apple implemented it, it would run away in
> > > Cinebench." is just idiotic in the face of the argument you want to make about SMT
> > > MT scaling and Apple's reasons and internal decisions about not implementing it.
> >
> > Sure, Cinebench isn't the best representation of average workloads. But SPEC is far
> > worse. No consumer cares about SPEC. The subtests are mostly based off applications
> > no one uses, or very specific scientific simulations. It's even useless as a benchmark
> > to see whether your system is working properly, because it's so overpriced.
> >
> > Even when they base a subtest off something a consumer might
> > do, the results are hilariously off. For example:
> > Encoding a 4K video using ffmpeg libx264 slow preset, on Haswell locked to 2.2 GHz
> > - affinity set to 4 threads: 6.6 fps
> > - no affinity set: 7.6 fps (1.15x scaling)
>
> You should manually set lower number threads, because if you do this via afinity, the encoder might still
> spawn the same number of threads. Or perhaps, the best way is actualy switching HT on and off.
> x264's thread number is complex, it is not = CPU's threads. It is CPU threads*1.5 because *1 didn't
> saturate cores/threads due to the way the frame-threading works as opposed to slice threading.
> So if you have 4c/4t CPU, 6 threads are spawned. If you are on 4c/8t, 12 threads are spawned.
> And besides that, input decoding and lookahead are done separately (and the lookahead might
> be actually slice-threaded too now, not sure, x265 does that, x264 does not).
> If you still used 12 threads pinned to 4 logical cores, you might have
> got a bit different FPS, possibly (not sure if for worse, who knows).
Ok, I'll try this sometime.
>
> >
> > 525.x264_r test: 1.0315x scaling
> >
>
> You know I am not completely sure about this as I can't check, but I heard that SPEC uses x264's
> code WITHOUT SIMD assembly compiled (whether disabling ASM intentionally or not compiling with yasm
> or whatever, no idea). If that is true it probably grossly changes character of the workload because
> compiler autovectorization tends to fails on vectorizing multimedia integer algorithms. Normally,
> x264 spends like 55% (figure given by devs 8-9 years back) in hand-tuned x86 SIMD.
>
> (Interestingly you would think that should push 1 thread to exploit the
> execution units more fully but it still has gains from SMT/HT...)
>
>
> > A quick look at the benchmark description shows them using -bitrate 1000.
> > If that's 1 kbps (or 1 mbps) bitrate, it's hilariously unrealistic.
>
> it has been long time since I have not used CRF but the bitrate is in kbps because that is the reasonalbe
> unit, so that is not a problem I think, 1000 kbps is fine (depends on resolution naturaly)
1000 kbps is way too low for any res today. Youtube isn't exactly the champion of quality. They recommend 5 mbps for 720p30, and 8 mbps for 1080p30. You can push a bit lower if you need to save every last bit, but I've never liked the results for say, 3.5 mbps w/720p30. 1000 kbps is ridiculous.
> > Now take AT Bench's POV-Ray scores for the i5-6600K (1741, 3.6 GHz all core turbo)
> > and i7-6700K (2419, 4.2 GHz all core turbo). Scaling down the 6700K's score to account
> > for clock speed difference gives 2073. SMT scaling would be roughly 1.19x
> >
> > 511.povray_r: 1.0291x scaling.
> >
> > What's going on?
> >
> > Also, some SPEC numbers make it seem like negative SMT scaling is common. It's not. I've
> > personally never seen an application that can use all available threads do worse when
> > SMT is enabled. Can we stop looking at the irrelevant pile of garbage that is SPEC?
> >
> > And what makes you claim "Cinebench has long dependency chains"? How do you know SMT scaling is from that
> > rather than hiding cache misses better? Because Cinebench (R20, have not tested R23) does suffer from L1/L2
> > cache misses. In ST mode, execution spends 9% of cycles stalled with a L1D miss pending. You get about
> > 16.4 L1D MPKI if you count loads hitting the fill buffer as misses. And L2 hitrate is around 50%.
>
>