By: rwessel (rwessel.delete@this.yahoo.com), June 7, 2022 7:46 am
Room: Moderated Discussions
Mark Roulo (nothanks.delete@this.xxx.com) on June 7, 2022 6:40 am wrote:
> rwessel (rwessel.delete@this.yahoo.com) on June 6, 2022 2:59 pm wrote:
> > Mark Roulo (nothanks.delete@this.xxx.com) on June 6, 2022 2:10 pm wrote:
> > > rwessel (rwessel.delete@this.yahoo.com) on June 6, 2022 11:57 am wrote:
> > > > Doug S (foo.delete@this.bar.bar) on June 6, 2022 10:44 am wrote:
> > > > > rwessel (rwessel.delete@this.yahoo.com) on June 6, 2022 5:26 am wrote:
> > > > > > Peter Lewis (peter.delete@this.notyahoo.com) on June 6, 2022 3:55 am wrote:
> > > > > > > > There's PGO of course, but that really only works for interpreted or jitted languages.
> > > > > > >
> > > > > > > The Intel C/C++ Compiler has Profile-Guided Optimization (PGO).
> > > > > > > Have you had some bad experience with PGO for compiled languages?
> > > > > >
> > > > > >
> > > > > > The fundamental problem with PGO is that it's (probably fundamentally) too hard to use the
> > > > > > vast majority of the time, at least with languages compiled in the traditional way.
> > > > > >
> > > > > > Unless you have simple to generate (and maintain!*) training datasets, have a very small area
> > > > > > where you can separately apply PGO (and can make the training sets small enough to make them
> > > > > > maintainable), or you can afford to invest in a quite large infrastructure to use and maintain
> > > > > > PGO (because, say, you have an absurd number of machine on which you're going to run this code
> > > > > > - consider Google), PGO is just too hard to use, and so is useless 99% of the time.
> > > > > >
> > > > > > As Peter pointed out, JIT'd languages can take advantage of PGO as well.
> > > > > >
> > > > > > *Code with limited lifespan, or code you somehow know isn't going to
> > > > > > change in the future, can reduce the maintenance requirements here.
> > > > >
> > > > >
> > > > > Has anyone ever tried having the CPU's branch predictor collecting info
> > > > > the OS can use to 'update' binaries with branch prediction info?
> > > > >
> > > > > Having data that is generated by the end user tweak their binaries' branch prediction seems like a
> > > > > better solution than having the developer try to come up with training data for the PGO phase. I know
> > > > > PGO does more than just predict branches, but in the case just limiting it to the topic at hand.
> > > > >
> > > > > I'm not sure how it would be implemented - you probably wouldn't actually update the binary
> > > > > itself (it may be on read only storage or shared by others) so there would need to be some
> > > > > sort of auxiliary data file (maybe stored somewhere like /var/lib or in the user's home directory
> > > > > / profile?) that would be used to tweak things when the executable is loaded.
> > > > >
> > > > > Just an idle thought here, I haven't really considered it more than the few minutes than
> > > > > it took to write this post so I could be missing some really big gotchas with this idea!
> > > >
> > > >
> > > > Not quite the same thing, but efforts have been made to save JIT'd code for use the next time.
> > >
> > > Example here: https://docs.oracle.com/cd/E13188_01/jrockit/docs142/userguide/codecach.html
> > >
> > > Java 1.4 is quite old and I don't know if they still do this. I remember
> > > reading at the time that the overhead of *managing* the cache was slower than
> > > just re-JITing. But maybe I remember wrong or maybe things got better.
> >
> >
> > I could certainly see that being the case if the JIT-ter limited itself to relatively small
> > chucks of code at a time. OTOH, I can image a system where rather large pieces of code
> > are profiled and rebuilt, and where the compilation overhead would be substantial.
>
> Going from memory I think the problem was that verifying that the JIT'd object code was okay
> to use for the current run took about as long (or maybe more time) as just re-JITing.
>
> This might have been an issue specific to Java where each class gets its own file. Or because
> there weren't any "-O3" type optimizations performed so the JITing was fairly fast.
>
> I can imagine a scheme where the unit of verification was
> the JAR file and the verification was idiot simple:
>
>
>
> In this case, normal use might just turn into a handful of checksum calculations and verifications
> followed by running object code. Which was probably NOT what JRockit did.
>
> The *other* issue that Java could run into is that good JITs would do some code path specific optimization
> that would be backed out if things changed. A good example of this is that JVMs could (and would) inline
> virtual functions if there was only one implementation of that virtual function. If a second implementation
> got loaded (with a new class) then the JIT would un-do the inlining. This could chain, of course ...
>
> Depending on how your Java was written, the Java classes might be loaded based
> on names in an 'ini' file so some of this verification could be tricky.
>
> Similar things could happen for global constants so a branch on a global constant could
> be eliminated because the JVM knew the value of the constant. Ahead-of-time compilation
> does this optimization, too, but the same 'issue' can arise: If this constant changes
> then LOTS of object code needs to be regenerated. Same for the JIT here.
I was actually thinking about a near(er)-AOT compilation case, what's sometimes called specialization. Something like MI code being specialized on AS/400/iOS. Or like what the Mill guys are talking about. That would seem to present a reasonable opportunity to feed profile data back, and re-specialize if it looks profitable. One nice thing about that approach is that translated code needs to deal with less dynamic change (of course, that's a disadvantage, too), and you don't have the time constraints on the compilation process, an you can spin that process off to the background (another core, or even off-hours).
I don't see why you couldn't do that with JBC, but most of the work there has focused on JIT-ing much smaller chunks of code.
> rwessel (rwessel.delete@this.yahoo.com) on June 6, 2022 2:59 pm wrote:
> > Mark Roulo (nothanks.delete@this.xxx.com) on June 6, 2022 2:10 pm wrote:
> > > rwessel (rwessel.delete@this.yahoo.com) on June 6, 2022 11:57 am wrote:
> > > > Doug S (foo.delete@this.bar.bar) on June 6, 2022 10:44 am wrote:
> > > > > rwessel (rwessel.delete@this.yahoo.com) on June 6, 2022 5:26 am wrote:
> > > > > > Peter Lewis (peter.delete@this.notyahoo.com) on June 6, 2022 3:55 am wrote:
> > > > > > > > There's PGO of course, but that really only works for interpreted or jitted languages.
> > > > > > >
> > > > > > > The Intel C/C++ Compiler has Profile-Guided Optimization (PGO).
> > > > > > > Have you had some bad experience with PGO for compiled languages?
> > > > > >
> > > > > >
> > > > > > The fundamental problem with PGO is that it's (probably fundamentally) too hard to use the
> > > > > > vast majority of the time, at least with languages compiled in the traditional way.
> > > > > >
> > > > > > Unless you have simple to generate (and maintain!*) training datasets, have a very small area
> > > > > > where you can separately apply PGO (and can make the training sets small enough to make them
> > > > > > maintainable), or you can afford to invest in a quite large infrastructure to use and maintain
> > > > > > PGO (because, say, you have an absurd number of machine on which you're going to run this code
> > > > > > - consider Google), PGO is just too hard to use, and so is useless 99% of the time.
> > > > > >
> > > > > > As Peter pointed out, JIT'd languages can take advantage of PGO as well.
> > > > > >
> > > > > > *Code with limited lifespan, or code you somehow know isn't going to
> > > > > > change in the future, can reduce the maintenance requirements here.
> > > > >
> > > > >
> > > > > Has anyone ever tried having the CPU's branch predictor collecting info
> > > > > the OS can use to 'update' binaries with branch prediction info?
> > > > >
> > > > > Having data that is generated by the end user tweak their binaries' branch prediction seems like a
> > > > > better solution than having the developer try to come up with training data for the PGO phase. I know
> > > > > PGO does more than just predict branches, but in the case just limiting it to the topic at hand.
> > > > >
> > > > > I'm not sure how it would be implemented - you probably wouldn't actually update the binary
> > > > > itself (it may be on read only storage or shared by others) so there would need to be some
> > > > > sort of auxiliary data file (maybe stored somewhere like /var/lib or in the user's home directory
> > > > > / profile?) that would be used to tweak things when the executable is loaded.
> > > > >
> > > > > Just an idle thought here, I haven't really considered it more than the few minutes than
> > > > > it took to write this post so I could be missing some really big gotchas with this idea!
> > > >
> > > >
> > > > Not quite the same thing, but efforts have been made to save JIT'd code for use the next time.
> > >
> > > Example here: https://docs.oracle.com/cd/E13188_01/jrockit/docs142/userguide/codecach.html
> > >
> > > Java 1.4 is quite old and I don't know if they still do this. I remember
> > > reading at the time that the overhead of *managing* the cache was slower than
> > > just re-JITing. But maybe I remember wrong or maybe things got better.
> >
> >
> > I could certainly see that being the case if the JIT-ter limited itself to relatively small
> > chucks of code at a time. OTOH, I can image a system where rather large pieces of code
> > are profiled and rebuilt, and where the compilation overhead would be substantial.
>
> Going from memory I think the problem was that verifying that the JIT'd object code was okay
> to use for the current run took about as long (or maybe more time) as just re-JITing.
>
> This might have been an issue specific to Java where each class gets its own file. Or because
> there weren't any "-O3" type optimizations performed so the JITing was fairly fast.
>
> I can imagine a scheme where the unit of verification was
> the JAR file and the verification was idiot simple:
>
>
"This entire block of JIT'd object code is valid only if this set of JARs match these hash values"
>
> In this case, normal use might just turn into a handful of checksum calculations and verifications
> followed by running object code. Which was probably NOT what JRockit did.
>
> The *other* issue that Java could run into is that good JITs would do some code path specific optimization
> that would be backed out if things changed. A good example of this is that JVMs could (and would) inline
> virtual functions if there was only one implementation of that virtual function. If a second implementation
> got loaded (with a new class) then the JIT would un-do the inlining. This could chain, of course ...
>
> Depending on how your Java was written, the Java classes might be loaded based
> on names in an 'ini' file so some of this verification could be tricky.
>
> Similar things could happen for global constants so a branch on a global constant could
> be eliminated because the JVM knew the value of the constant. Ahead-of-time compilation
> does this optimization, too, but the same 'issue' can arise: If this constant changes
> then LOTS of object code needs to be regenerated. Same for the JIT here.
I was actually thinking about a near(er)-AOT compilation case, what's sometimes called specialization. Something like MI code being specialized on AS/400/iOS. Or like what the Mill guys are talking about. That would seem to present a reasonable opportunity to feed profile data back, and re-specialize if it looks profitable. One nice thing about that approach is that translated code needs to deal with less dynamic change (of course, that's a disadvantage, too), and you don't have the time constraints on the compilation process, an you can spin that process off to the background (another core, or even off-hours).
I don't see why you couldn't do that with JBC, but most of the work there has focused on JIT-ing much smaller chunks of code.