By: Travis Downs (travis.downs.delete@this.gmail.com), August 12, 2019 8:15 am
Room: Moderated Discussions
> The main reason why gcc/clang compilation scales well with multiple x86 cores is that it is performing
> a lot of repetitive work. If there was proper caching implemented preventing repetitive work on sub-file
> granularity then compiler jobs wouldn't scale that well, even in case of many-file builds like the Linux
> kernel maybe except for the very first build ever on the machine. Even if it is the very first build
> of the Linux kernel on a machine it is likely that it would be possible to somewhat speedup the build
> and avoid some local work by concurrently downloading data from a world-wide compiler cache.
>
> Amdahl's law does not apply to the scaling of suboptimal algorithms which can easily
> fabricate near-linear multi-core scaling by performing redundant computations.
>
> Assuming "occ" is an optimal C compiler executable, if "occ foo.c" takes 1 second on a clean/pristine
> machine and "occ bar.c" also takes 1 second on a clean machine and if foo.c and bar.c have something
> in common (i.e: bar.c's structure compresses well assuming foo.c's structure is used to initialize
> the compressor's dictionary), then compiling the two files serially on a clean machine by running
> "occ foo.c; occ bar.c" will take 1.5 seconds. Compiling the two files in parallel on a dual-core x86
> CPU by running "occ foo.c & occ bar.c & wait" cannot complete faster than in 1 second, which yields
> a dual-core scaling of 50%. -- Because of suboptimal algorithms in gcc/clang, compiling the two files
> serially on a clean machine by running "cc foo.c; cc bar.c" will take 2 seconds, which miraculously
> yields 100% scaling when we run "cc foo.c & cc bar.c & wait" on a dual-core x86 CPU.
>
> -atom
Sure, but once we live in the currently hypothetical world where C compilers take advantage of all the inter-TU redundancy, why would you assume that the only parallelism they can exploit is between TUs? Such an 'optimal' compiler could also parallelize in a fine-grained way within a single TU.
Of course, I don't think we'll get such fine-grained parallelism any time soon, but I don't think we'll get the other parts of 'occ' either, so process-per-TU or thread-per-TU will remain a pretty good model for large C projects.
My personal feeling is that almost any CPU-bound load of at least 10s of ms is parallelizable. It's just a matter of how hard it is. Most "single threaded loads" are not single threaded due to a fundamental algorithmic constraint, but becuase nobody has parallelized it, perhaps because it isn't worth it or just haven't gotten around to it.
You really have to go out of your way to find a load that truly can't be parallelized: something like gz decompression of a single large file: but that's just a consequence of the way gzip is designed - another more or less equivalent format could support it with a few changes..