By: Andrew Clough (someone.delete@this.somewhere.com), June 6, 2022 11:08 am
Room: Moderated Discussions
Doug S (foo.delete@this.bar.bar) on June 6, 2022 10:44 am wrote:
>
> Has anyone ever tried having the CPU's branch predictor collecting info
> the OS can use to 'update' binaries with branch prediction info?
>
> Having data that is generated by the end user tweak their binaries' branch prediction seems like a
> better solution than having the developer try to come up with training data for the PGO phase. I know
> PGO does more than just predict branches, but in the case just limiting it to the topic at hand.
>
> I'm not sure how it would be implemented - you probably wouldn't actually update the binary
> itself (it may be on read only storage or shared by others) so there would need to be some
> sort of auxiliary data file (maybe stored somewhere like /var/lib or in the user's home directory
> / profile?) that would be used to tweak things when the executable is loaded.
>
> Just an idle thought here, I haven't really considered it more than the few minutes than
> it took to write this post so I could be missing some really big gotchas with this idea!
The Mill guys are apparently thinking of doing that. link
>
> Has anyone ever tried having the CPU's branch predictor collecting info
> the OS can use to 'update' binaries with branch prediction info?
>
> Having data that is generated by the end user tweak their binaries' branch prediction seems like a
> better solution than having the developer try to come up with training data for the PGO phase. I know
> PGO does more than just predict branches, but in the case just limiting it to the topic at hand.
>
> I'm not sure how it would be implemented - you probably wouldn't actually update the binary
> itself (it may be on read only storage or shared by others) so there would need to be some
> sort of auxiliary data file (maybe stored somewhere like /var/lib or in the user's home directory
> / profile?) that would be used to tweak things when the executable is loaded.
>
> Just an idle thought here, I haven't really considered it more than the few minutes than
> it took to write this post so I could be missing some really big gotchas with this idea!
The Mill guys are apparently thinking of doing that. link
The exit table is seeded from a saved exit table from a dedicated load module section. This predefined exit table is initially produced by the compiler, which can perfectly create all unconditional branches and often has a pretty good idea of conditional branches as well. This can be improved upon with profilers. And most conveniently, the hardware provides facilities to dump the hardware exit table to memory. The system runtime can utilize this to update the exit table section in the load modules with real world usage data. Properly automated, this can improve future program performance every time a user's program is run. And manually, in the hands of a skilled developer this can tickle out the last drop of performance.
The pre-seeding of the exit table happens in batches, to actually gain performance from prediction and not do memory lookups every time a user gets stuck. Whenever a call target isn't in the exit table, all exit table entry predictions connected to that call are loaded at once from the load module. That way a user doesn't inch through the code with a load on every branch. There is a prediction load and a code load on the first few calls (often constructors and main()), and after that it is up and running almost at normal speed with the run time history updates getting the last few percent.