By: Eugene Nalimov (enal.delete@this.google.-dot-.com), November 14, 2009 8:24 pm
Room: Moderated Discussions
I spent 8 years working on Visual C++ back-end, and everything is not as bad. With very reasonable effort it is possible to speed up compilation by factor of ~2.5x on 4-core CPU. Without major redesign, in an existing compiler codebase. Memory use would grow insignificantly.
Major problems are caused by non-thread-safety of lot of existing data structures. I.e. problems are not with algorithms but with particular implementation. Years ago, when initial compiler code was written, nobody thought about compiling several functions in parallel. Then we added LTCG to the compiler, and amount of code that is compiled in one compiland suddenly grew. I remember debugging compiler crash that came from customer -- we asserted after 16 hours of work, compiling function #250,000 (that was debug version of the compiler, several times slower than production one, so customer had better experience...)
Further speedups and better scaling are harder because of Amdahl law, but are possible, too.
Unfortunately I cannot go into details in the public forum...
Thanks,
Eugene
rwessel (robertwessel@yahoo.com) on 11/12/09 wrote:
---------------------------
>? (0xe2.0x9a.0x9b@gmail.com) on 11/12/09 wrote:
>---------------------------
>>Potatoswatter (potswa_m@c.com) on 11/11/09 wrote:
>>---------------------------
>>>It sounds more like you're talking about pipelining the compiler: one thread parses
>>>while another thread generates code. Unfortunately that only works when you know
>>>the relative speeds of the pipeline stages.
>>
>>Couldn't that (at least partially) be solved by inserting a buffer between the pipeline stages? Like:
>>
>>[parser thread] ---> [buffer] ---> [compiler thread]
>
>
>Yes, some buffering would go a long way to allowing that sort of structure to work.
>The problem is that you need still need to be able to break up the work into relatively
>even sized pieces. Let's say, for example, that the parser in the above did 10
>units of work, while the compiler did 1000. The parallelism is simply not going
>to buy you much (about 1%, in fact). In fact, parsing is typically a pretty small part of a modern compilation process.
>
>Worse, most things compilers do these days *don't* organize well as processes on
>streams. Now that definitely did used to be the case, and many compilers basically
>were dozens of passes each reading a sequential input from the prior pass and writing
>a sequential output for the next pass. Parallelizing that sort of thing with the above structure would work well.
>
>Unfortunately these days compilers tend to build a single big structure for the
>entire program and then work on that as a whole, and the "passes" are far, far less
>distinct. In fact, typically the only fairly independent passes are the "parse"
>and "code generation" steps (and a preprocessor, if your language includes that).
>And of those two or three passes, only code generation is a real CPU hog these
>days. Most I/O ends up happening in the parsing (or preprocessor) phases, code
>generation typically just does a big sequential write of the object file at the
>end after mucking around with the internal representation of the program for a long while.
>
>While it might appear that different routines (functions, procedures, or whatever
>your language calls them) could reasonably go through code generation and optimization
>independently, that's typically not possible since all modern compilers do things
>like inlining and inter-procedure optimizations. And it's those kinds of optimizations
>that are particularly CPU intensive.
>
>Now that doesn't mean that real projects can't use parallelism during the build
>process. Any parallel make tool will do that for you, and is probably already being
>used on large projects. Unfortunately Amdahl bites us at this point and often one
>of the biggest consumers of (wall clock) time is then the link step (and frankly
>linker performance is often astonishingly slow).
>
>This is made worse by the increasing use of link time code generation, which leaves
>little more than the parsing step actually happening in the "compiler" invocation.
>And basically the whole point of LTCG is to expose *more* inter-procedure optimization
>possibilities to the compiler, so that the LTCG function now has more of the most expensive work to do.
>
>Now a project that generates multiple binaries still gets considerable benefit
>from a parallel make, so long as there are not dependencies between the links (in
>Windows, for example, a build of a DLL often generates an import library which is
>a dependency for another executable, which can bite).
>
>But in general, I think the compilation of a single program is one of the least profitable places to apply parallelism.
>
Major problems are caused by non-thread-safety of lot of existing data structures. I.e. problems are not with algorithms but with particular implementation. Years ago, when initial compiler code was written, nobody thought about compiling several functions in parallel. Then we added LTCG to the compiler, and amount of code that is compiled in one compiland suddenly grew. I remember debugging compiler crash that came from customer -- we asserted after 16 hours of work, compiling function #250,000 (that was debug version of the compiler, several times slower than production one, so customer had better experience...)
Further speedups and better scaling are harder because of Amdahl law, but are possible, too.
Unfortunately I cannot go into details in the public forum...
Thanks,
Eugene
rwessel (robertwessel@yahoo.com) on 11/12/09 wrote:
---------------------------
>? (0xe2.0x9a.0x9b@gmail.com) on 11/12/09 wrote:
>---------------------------
>>Potatoswatter (potswa_m@c.com) on 11/11/09 wrote:
>>---------------------------
>>>It sounds more like you're talking about pipelining the compiler: one thread parses
>>>while another thread generates code. Unfortunately that only works when you know
>>>the relative speeds of the pipeline stages.
>>
>>Couldn't that (at least partially) be solved by inserting a buffer between the pipeline stages? Like:
>>
>>[parser thread] ---> [buffer] ---> [compiler thread]
>
>
>Yes, some buffering would go a long way to allowing that sort of structure to work.
>The problem is that you need still need to be able to break up the work into relatively
>even sized pieces. Let's say, for example, that the parser in the above did 10
>units of work, while the compiler did 1000. The parallelism is simply not going
>to buy you much (about 1%, in fact). In fact, parsing is typically a pretty small part of a modern compilation process.
>
>Worse, most things compilers do these days *don't* organize well as processes on
>streams. Now that definitely did used to be the case, and many compilers basically
>were dozens of passes each reading a sequential input from the prior pass and writing
>a sequential output for the next pass. Parallelizing that sort of thing with the above structure would work well.
>
>Unfortunately these days compilers tend to build a single big structure for the
>entire program and then work on that as a whole, and the "passes" are far, far less
>distinct. In fact, typically the only fairly independent passes are the "parse"
>and "code generation" steps (and a preprocessor, if your language includes that).
>And of those two or three passes, only code generation is a real CPU hog these
>days. Most I/O ends up happening in the parsing (or preprocessor) phases, code
>generation typically just does a big sequential write of the object file at the
>end after mucking around with the internal representation of the program for a long while.
>
>While it might appear that different routines (functions, procedures, or whatever
>your language calls them) could reasonably go through code generation and optimization
>independently, that's typically not possible since all modern compilers do things
>like inlining and inter-procedure optimizations. And it's those kinds of optimizations
>that are particularly CPU intensive.
>
>Now that doesn't mean that real projects can't use parallelism during the build
>process. Any parallel make tool will do that for you, and is probably already being
>used on large projects. Unfortunately Amdahl bites us at this point and often one
>of the biggest consumers of (wall clock) time is then the link step (and frankly
>linker performance is often astonishingly slow).
>
>This is made worse by the increasing use of link time code generation, which leaves
>little more than the parsing step actually happening in the "compiler" invocation.
>And basically the whole point of LTCG is to expose *more* inter-procedure optimization
>possibilities to the compiler, so that the LTCG function now has more of the most expensive work to do.
>
>Now a project that generates multiple binaries still gets considerable benefit
>from a parallel make, so long as there are not dependencies between the links (in
>Windows, for example, a build of a DLL often generates an import library which is
>a dependency for another executable, which can bite).
>
>But in general, I think the compilation of a single program is one of the least profitable places to apply parallelism.
>
Topic | Posted By | Date |
---|---|---|
Article: Computational Efficiency in Modern Processors by DK | MoTheG | 2009/11/08 06:02 AM |
Article: Computational Efficiency in Modern Processors by DK | none | 2009/11/08 06:15 AM |
Silverthorne and OoO vs. InOrd | MoTheG | 2009/11/08 06:22 AM |
Silverthorne and OoO vs. InOrd | David Kanter | 2009/11/08 03:11 PM |
Magical 100x speedups | AM | 2009/11/09 08:03 AM |
Magical 100x speedups | David Kanter | 2009/11/09 11:41 AM |
Magical 100x speedups | none | 2009/11/09 12:36 PM |
Magical speedups | David Kanter | 2009/11/09 02:24 PM |
Magical speedups | none | 2009/11/09 02:40 PM |
Hardware Specs | MS | 2009/11/09 04:49 PM |
44x faster than a single cpu core | Vincent Diepeveen | 2009/11/10 07:17 AM |
Magical speedups | Vincent Diepeveen | 2009/11/10 07:02 AM |
Xeon 130x speedup vs Xeon | Eric Bron | 2009/11/10 07:20 AM |
Magical 100x speedups | AM | 2009/11/10 09:42 AM |
Magical 100x speedups | Linus Torvalds | 2009/11/10 12:19 PM |
Mega speedups | AM | 2009/11/11 05:21 AM |
Bogus 100x speedups | David Kanter | 2009/11/10 12:26 AM |
No speedups for CPUs for the general programming populace | MoTheG | 2009/11/10 04:26 AM |
Bogus 100x speedups | ? | 2009/11/10 04:45 AM |
Bogus 100x speedups | hobold | 2009/11/10 06:31 AM |
Bogus 100x speedups | Vincent Diepeveen | 2009/11/10 07:26 AM |
Bogus 100x speedups | sylt | 2009/11/10 09:00 AM |
Bogus 100x speedups | AM | 2009/11/10 09:47 AM |
GPU vs. CPU | MoTheG | 2009/11/09 10:30 AM |
GPU vs. CPU | a reader | 2009/11/09 06:58 PM |
ease of programming | MoTheG | 2009/11/09 10:45 PM |
yes for GPU programming you need non-public info | Vincent Diepeveen | 2009/11/10 07:36 AM |
yes for GPU programming you need non-public info | Potatoswatter | 2009/11/11 07:06 AM |
yes for GPU programming you need non-public info | Vincent Diepeveen | 2009/11/11 10:23 AM |
yes for GPU programming you need non-public info | Potatoswatter | 2009/11/11 12:26 PM |
Real businesses use GPGPU. | Jouni Osmala | 2009/11/11 10:00 PM |
GPU vs. CPU | ? | 2009/11/10 05:01 AM |
2. try but most is said, just clarifying | MoTheG | 2009/11/10 09:24 AM |
2. try but most is said, just clarifying | ? | 2009/11/11 12:11 AM |
you missread me | MoTheG | 2009/11/11 11:33 PM |
you missread me | ? | 2009/11/12 12:18 AM |
2. try but most is said, just clarifying | Potatoswatter | 2009/11/11 07:22 AM |
2. try but most is said, just clarifying | ? | 2009/11/12 12:22 AM |
loose, not so orderly | MoTheG | 2009/11/12 11:47 AM |
loose, not so orderly | Potatoswatter | 2009/11/12 05:50 PM |
2. try but most is said, just clarifying | rwessel | 2009/11/12 12:01 PM |
2. try but most is said, just clarifying | Gabriele Svelto | 2009/11/12 11:39 PM |
2. try but most is said, just clarifying | ? | 2009/11/13 12:14 AM |
2. try but most is said, just clarifying | Gabriele Svelto | 2009/11/13 12:30 AM |
2. try but most is said, just clarifying | rwessel | 2009/11/13 12:24 PM |
2. try but most is said, just clarifying | Michael S | 2009/11/14 12:08 PM |
2. try but most is said, just clarifying | Gabriele Svelto | 2009/11/14 10:38 PM |
2. try but most is said, just clarifying | Andi Kleen | 2009/11/15 12:19 AM |
2. try but most is said, just clarifying | Michael S | 2009/11/15 12:58 AM |
2. try but most is said, just clarifying | Eric Bron | 2009/11/15 01:25 AM |
/MP option | Eric Bron | 2009/11/15 01:33 AM |
/MP option | Paul | 2009/11/15 08:42 AM |
/MP option | Eric Bron | 2009/11/15 12:22 PM |
2. try but most is said, just clarifying | ? | 2009/11/15 02:13 AM |
2. try but most is said, just clarifying | Michael S | 2009/11/15 04:14 AM |
2. try but most is said, just clarifying | Eugene Nalimov | 2009/11/14 08:24 PM |
Atom point | AM | 2009/11/09 08:00 AM |
Atom TDP | David Kanter | 2009/11/09 11:48 AM |
Atom TDP | hobold | 2009/11/10 06:41 AM |
Atom TDP | AM | 2009/11/10 09:49 AM |