By: ⚛ (0xe2.0x9a.0x9b.delete@this.gmail.com), August 12, 2019 12:51 am
Room: Moderated Discussions
Linus Torvalds (torvalds.delete@this.linux-foundation.org) on August 10, 2019 12:13 pm wrote:
> I'm a big proponent of single-thread performance in general, because a lot
> of real-world problems really do end up being fairly limited by Amdahl.
>
> So you'll find me often talking up single-core performance,
> and I absolutely despise the "flock of chickens" machines.
>
> But realistically, Zen 2 is clearly in the "good enough" territory for anything I do on that front.
> It will open that huge pdf file without me twiddling my thumbs, even when that's almost entirely
> single-threaded. And once something performs well enough, all that I really do is build the kernel.
> Which is actually ludicrously well parallelized - more so than most other projects are.
>
> So I think single-thread performance is king, but I also know that the only thing I personally do
> doesn't really care all that deeply. We've got a couple of link stages and a few other serialized parts,
> but the really expensive parts when I do a full re-build can easily use hundreds of cores.
>
> I just don't think that because I can use hundreds of cores that that
> is necessarily a good fit for a lot of other real-life problems.
>
> From a performance standpoint I could easily use server-class machines (or something like Threadripper).
> Or even a farm. It's just that I also want it to be quiet and a convenient form factor, and easily
> available. If I can't buy the parts at the local Fry's or with two-day shipping off Amazon, I'm
> just not interested. You can keep your bespoke stuff. I believe in mass market.
>
> Linus
The main reason why gcc/clang compilation scales well with multiple x86 cores is that it is performing a lot of repetitive work. If there was proper caching implemented preventing repetitive work on sub-file granularity then compiler jobs wouldn't scale that well, even in case of many-file builds like the Linux kernel maybe except for the very first build ever on the machine. Even if it is the very first build of the Linux kernel on a machine it is likely that it would be possible to somewhat speedup the build and avoid some local work by concurrently downloading data from a world-wide compiler cache.
Amdahl's law does not apply to the scaling of suboptimal algorithms which can easily fabricate near-linear multi-core scaling by performing redundant computations.
Assuming "occ" is an optimal C compiler executable, if "occ foo.c" takes 1 second on a clean/pristine machine and "occ bar.c" also takes 1 second on a clean machine and if foo.c and bar.c have something in common (i.e: bar.c's structure compresses well assuming foo.c's structure is used to initialize the compressor's dictionary), then compiling the two files serially on a clean machine by running "occ foo.c; occ bar.c" will take 1.5 seconds. Compiling the two files in parallel on a dual-core x86 CPU by running "occ foo.c & occ bar.c & wait" cannot complete faster than in 1 second, which yields a dual-core scaling of 50%. -- Because of suboptimal algorithms in gcc/clang, compiling the two files serially on a clean machine by running "cc foo.c; cc bar.c" will take 2 seconds, which miraculously yields 100% scaling when we run "cc foo.c & cc bar.c & wait" on a dual-core x86 CPU.
-atom
> I'm a big proponent of single-thread performance in general, because a lot
> of real-world problems really do end up being fairly limited by Amdahl.
>
> So you'll find me often talking up single-core performance,
> and I absolutely despise the "flock of chickens" machines.
>
> But realistically, Zen 2 is clearly in the "good enough" territory for anything I do on that front.
> It will open that huge pdf file without me twiddling my thumbs, even when that's almost entirely
> single-threaded. And once something performs well enough, all that I really do is build the kernel.
> Which is actually ludicrously well parallelized - more so than most other projects are.
>
> So I think single-thread performance is king, but I also know that the only thing I personally do
> doesn't really care all that deeply. We've got a couple of link stages and a few other serialized parts,
> but the really expensive parts when I do a full re-build can easily use hundreds of cores.
>
> I just don't think that because I can use hundreds of cores that that
> is necessarily a good fit for a lot of other real-life problems.
>
> From a performance standpoint I could easily use server-class machines (or something like Threadripper).
> Or even a farm. It's just that I also want it to be quiet and a convenient form factor, and easily
> available. If I can't buy the parts at the local Fry's or with two-day shipping off Amazon, I'm
> just not interested. You can keep your bespoke stuff. I believe in mass market.
>
> Linus
The main reason why gcc/clang compilation scales well with multiple x86 cores is that it is performing a lot of repetitive work. If there was proper caching implemented preventing repetitive work on sub-file granularity then compiler jobs wouldn't scale that well, even in case of many-file builds like the Linux kernel maybe except for the very first build ever on the machine. Even if it is the very first build of the Linux kernel on a machine it is likely that it would be possible to somewhat speedup the build and avoid some local work by concurrently downloading data from a world-wide compiler cache.
Amdahl's law does not apply to the scaling of suboptimal algorithms which can easily fabricate near-linear multi-core scaling by performing redundant computations.
Assuming "occ" is an optimal C compiler executable, if "occ foo.c" takes 1 second on a clean/pristine machine and "occ bar.c" also takes 1 second on a clean machine and if foo.c and bar.c have something in common (i.e: bar.c's structure compresses well assuming foo.c's structure is used to initialize the compressor's dictionary), then compiling the two files serially on a clean machine by running "occ foo.c; occ bar.c" will take 1.5 seconds. Compiling the two files in parallel on a dual-core x86 CPU by running "occ foo.c & occ bar.c & wait" cannot complete faster than in 1 second, which yields a dual-core scaling of 50%. -- Because of suboptimal algorithms in gcc/clang, compiling the two files serially on a clean machine by running "cc foo.c; cc bar.c" will take 2 seconds, which miraculously yields 100% scaling when we run "cc foo.c & cc bar.c & wait" on a dual-core x86 CPU.
-atom