Pages: 1 2
Rob Thorpe and David Kanter for Real World Technologies recently had the opportunity to sit down with Gary Carleton from Intel to discuss compilers for IA-32 and IA-64. Following is the transcript from that discusson:
RWT: What is the difference in the size of binaries between x86 and IPF?
Gary: The 32-bit instruction set (IA-32) is denser and variable length; this confers an advantage in encoding. The Itanium(R) processor family must deal with 64-bit constants; its encoding is fixed length. It also has many special features. In the Itanium processor family the instruction set gives the compiler more opportunity to communicate details about how the instruction executes (ex: branch hints, predication references…) and the compiler must support these.
This is related to Profile Guided Optimizations (PGO) which can make a big difference to the usefulness of these features. In general, PGO tends to make more of a difference to overall performance on the Itanium processor family than IA-32. It also helps certain types of applications (databases, OS’s) more than others. We are also investigating ways to make it easier for the developer to use. Currently PGO involves an initial instrumented compilation, then a run of the instrumented app that generates an execution profile, then a subsequent compilation that uses the profile data.
RWT: I’ve noticed that profile guided optimization seems to work very well in some cases and not at all in others, what is your experience?
Gary: We categorize codes into two general sorts:
Loop oriented code, such as numerical code, and branchy code that uses lots of conditionals and subroutines, such as databases and operating systems. We have found that PGO works best on codes that contain lots of branches, status checking and other codes that don’t have specific performance bottlenecks, as opposed to codes that sit in tight loops for significant amount of time.
RWT: Have you thought about using runtime optimizations? Like HP’s Dynamo project?
Gary: We currently do not support runtime optimizations, but we are always exploring new technologies and techniques that will consistently improve the performance of our compiler.
RWT: Out of Intel’s two architectures which do you concentrate on?
Gary: We have different code generators groups for the IA-32 and Itanium processor family processors and we don’t really concentrate more on one versus the other. The rest of the compiler developers don’t focus on specific architectures (front end, scalar optimizations, interprocedural optimizations, PGO).
RWT: What are the differences in working on out-of-order machines such as the P4 compared to in-order machines like IPF?
Gary: OoO makes some things easier, since you can concentrate less on scheduling. However, it is far harder to visualize and predict the behavior of an OoO processor.
RWT: Does the scheduler know about OoO?
Gary: Yes, the scheduler knows about and uses OoO execution.
RWT: Which is more interesting to work on?
Gary: IPF, because of the new features such as speculation and predication. Predication is an architectural feature in which instructions are conditionally executed based on the setting of a Boolean register in the processor. This allows conditional execution, like if statements, to be based on register values instead of conditional branches, which can cause perturbations in the fetching of the instruction stream.
RWT: Which would you say was the most interesting project you’ve worked on at Intel?
Gary: I would say either the original Pentium or the first days of the Itanium. Both were very different from what was around at the time. The Pentium introduced superscalar execution to x86, but the 2nd pipeline (the V-pipe) was restricted in what it could execute. We had to make our own compilers cope with this, and also help other vendors use it properly. The Pentium Pro was an unbalancing time; it was the first Intel x86 OoO processor. There are heuristics such as the 4-1-1 rule to predict how it behaves, but it isn’t certain. I’m now working on the Vtune performance analyzer for both IPF and x86.
RWT: How much growth do you see in both architectures in the future?
Gary: I’ve been constantly surprised by the capability of architects to come up with new ideas. Both have a long future.
RWT: Have you thought about using multiple threads for speculative precomputation on IPF?
Gary: We haven’t really thought about it for IPF. The problems are the same with any form of speculation; you have to be careful not to pollute the bus or cache.
RWT: How much time to you spend on low-level versus high-level optimizations?
Gary: We are split into groups that deal with various parts of the compiler, such as code generation and profile guided optimizations. The split between high and low level optimizations is about 50/50. We still spend time on low level optimizations such as code generation, scalar optimizations, and interprocedural optimizations. Some of the high level optimization techniques, such as PGO, enable low level optimizations so the split between high and low level is not always straightforward.
RWT: How does compiler optimization for branch-oriented code differ from compiler optimization for loop oriented code?
Gary: Optimizations for branch-oriented code revolve mostly around trying to make efficient use of the instruction cache and Instruction TLB, and increase the correctness of branch prediction. Much of the idea is to place together all code that is frequently executed so that extra instruction cache space (and bus bandwidth) isn’t spent dealing with instructions aren’t likely to be executed. All our compilers have heuristics that try to do this, but PGO is the technique that best enables these optimizations.
With loop oriented code the focus isn’t as much on getting instructions efficiently into the CPU, as its simply performing code transformations that make loops run faster: loop interchange to make data access patterns more efficient, better instruction selection for loops, loop unrolling, instruction scheduling…
There are some optimizations that can apply to both (ex: inlining), but the main difference with branchy code is focusing on not wasting processor resources by dealing with code that is infrequently executed.
RWT: Which is harder to implement and why?
Gary: Given my recent discussions with other compiler engineers, there is probably a difference of opinion here, but my opinion is that branchy optimizations are harder, probably because it is harder to draw conclusions about how much a particular optimization will help. The effects of instruction cache pollution (our term for when non-executed code ends up in the instruction cache, ex: the else clause of a normally true if statement) are hard to generalize across many cases. It’s too app and system dependent; which makes it hard to know whether a particular optimization idea will really make any difference.
RWT: How can PGO be made easier for developers etc.?
Gary: I was hoping you could answer this one for me. :-)
RWT: We’ll work on that one for a bit…
RWT: There have been several complaints about the number of bugs in the Intel compilers. What is being done about it?
Gary: We are constantly working to improve the quality of our compiler. We have a web-based interface where users can submit bugs. These bugs go to the support group. They are passed to a group we call “technical consulting engineers” they try to replicate the bug. If it’s simple they will fix it. If not then it goes to the developers. We are improving processes for robustness all the time.
RWT: In the future will the compiler add temporary variables to reduce dependencies?
Gary: I thought we did that. Yes, we are trying to do that in the general case.
RWT: What language are Intel’s compilers written in?
Gary: They are written in C and C++.
Discuss (43 comments)