By: Simon Farnsworth (simon.delete@this.farnz.org.uk), June 7, 2022 2:47 am
Room: Moderated Discussions
Doug S (foo.delete@this.bar.bar) on June 6, 2022 10:44 am wrote:
> rwessel (rwessel.delete@this.yahoo.com) on June 6, 2022 5:26 am wrote:
> > Peter Lewis (peter.delete@this.notyahoo.com) on June 6, 2022 3:55 am wrote:
> > > > There's PGO of course, but that really only works for interpreted or jitted languages.
> > >
> > > The Intel C/C++ Compiler has Profile-Guided Optimization (PGO).
> > > Have you had some bad experience with PGO for compiled languages?
> >
> >
> > The fundamental problem with PGO is that it's (probably fundamentally) too hard to use the
> > vast majority of the time, at least with languages compiled in the traditional way.
> >
> > Unless you have simple to generate (and maintain!*) training datasets, have a very small area
> > where you can separately apply PGO (and can make the training sets small enough to make them
> > maintainable), or you can afford to invest in a quite large infrastructure to use and maintain
> > PGO (because, say, you have an absurd number of machine on which you're going to run this code
> > - consider Google), PGO is just too hard to use, and so is useless 99% of the time.
> >
> > As Peter pointed out, JIT'd languages can take advantage of PGO as well.
> >
> > *Code with limited lifespan, or code you somehow know isn't going to
> > change in the future, can reduce the maintenance requirements here.
>
>
> Has anyone ever tried having the CPU's branch predictor collecting info
> the OS can use to 'update' binaries with branch prediction info?
>
> Having data that is generated by the end user tweak their binaries' branch prediction seems like a
> better solution than having the developer try to come up with training data for the PGO phase. I know
> PGO does more than just predict branches, but in the case just limiting it to the topic at hand.
>
> I'm not sure how it would be implemented - you probably wouldn't actually update the binary
> itself (it may be on read only storage or shared by others) so there would need to be some
> sort of auxiliary data file (maybe stored somewhere like /var/lib or in the user's home directory
> / profile?) that would be used to tweak things when the executable is loaded.
>
> Just an idle thought here, I haven't really considered it more than the few minutes than
> it took to write this post so I could be missing some really big gotchas with this idea!
As background, https://research.google/pubs/pub45290/ and the open source artifacts from that research at https://gcc.gnu.org/wiki/AutoFDO (for GCC) and https://github.com/google/autofdo (for LLVM) show that you can use the hardware's performance counters (Linux perf tool) to create profiles for PGO.
From there, you can go to https://research.facebook.com/publications/bolt-a-practical-binary-optimizer-for-data-centers-and-beyond/ which talks about rewriting binaries (with debug information) using profile data, not just passing the profile data backwards through the chain to the compiler. BOLT is able to do a few optimizations that the compiler misses, but most of what it does is rearranging the binary to be much more cache-friendly, including making it much more TLB-friendly.
> rwessel (rwessel.delete@this.yahoo.com) on June 6, 2022 5:26 am wrote:
> > Peter Lewis (peter.delete@this.notyahoo.com) on June 6, 2022 3:55 am wrote:
> > > > There's PGO of course, but that really only works for interpreted or jitted languages.
> > >
> > > The Intel C/C++ Compiler has Profile-Guided Optimization (PGO).
> > > Have you had some bad experience with PGO for compiled languages?
> >
> >
> > The fundamental problem with PGO is that it's (probably fundamentally) too hard to use the
> > vast majority of the time, at least with languages compiled in the traditional way.
> >
> > Unless you have simple to generate (and maintain!*) training datasets, have a very small area
> > where you can separately apply PGO (and can make the training sets small enough to make them
> > maintainable), or you can afford to invest in a quite large infrastructure to use and maintain
> > PGO (because, say, you have an absurd number of machine on which you're going to run this code
> > - consider Google), PGO is just too hard to use, and so is useless 99% of the time.
> >
> > As Peter pointed out, JIT'd languages can take advantage of PGO as well.
> >
> > *Code with limited lifespan, or code you somehow know isn't going to
> > change in the future, can reduce the maintenance requirements here.
>
>
> Has anyone ever tried having the CPU's branch predictor collecting info
> the OS can use to 'update' binaries with branch prediction info?
>
> Having data that is generated by the end user tweak their binaries' branch prediction seems like a
> better solution than having the developer try to come up with training data for the PGO phase. I know
> PGO does more than just predict branches, but in the case just limiting it to the topic at hand.
>
> I'm not sure how it would be implemented - you probably wouldn't actually update the binary
> itself (it may be on read only storage or shared by others) so there would need to be some
> sort of auxiliary data file (maybe stored somewhere like /var/lib or in the user's home directory
> / profile?) that would be used to tweak things when the executable is loaded.
>
> Just an idle thought here, I haven't really considered it more than the few minutes than
> it took to write this post so I could be missing some really big gotchas with this idea!
As background, https://research.google/pubs/pub45290/ and the open source artifacts from that research at https://gcc.gnu.org/wiki/AutoFDO (for GCC) and https://github.com/google/autofdo (for LLVM) show that you can use the hardware's performance counters (Linux perf tool) to create profiles for PGO.
From there, you can go to https://research.facebook.com/publications/bolt-a-practical-binary-optimizer-for-data-centers-and-beyond/ which talks about rewriting binaries (with debug information) using profile data, not just passing the profile data backwards through the chain to the compiler. BOLT is able to do a few optimizations that the compiler misses, but most of what it does is rearranging the binary to be much more cache-friendly, including making it much more TLB-friendly.