By: Brendan (btrotter.delete@this.gmail.com), May 25, 2022 6:16 pm
Room: Moderated Discussions
Hi,
Jukka Larja (roskakori2006.delete@this.gmail.com) on May 25, 2022 6:24 am wrote:
> Brendan (btrotter.delete@this.gmail.com) on May 24, 2022 5:09 pm wrote:
>
> > a) even if ISAs are exactly the same there could be up to 10% performance/efficiency improvement because
> > lots of optimizations (instruction selection and scheduling, which instructions a fused or not, prefetch
> > scheduling distance, whether branch prediction has aliasing issues with "too many branches too close",
> > which cache size for cache blocking optimizations, ...) depend on micro-arch (and P cores and E cores use
> > very different micro-arch, and ARM's "big" cores and "little" cores use very different micro-arch)
>
> How much CPU model optimized code do you think is running on PCs? I tried to find such parameters
> for Visual Studio, but failed. Doesn't seem to be something developers often do.
Honestly; with the growing number of software developers sacrificing the quality of their end product to reduce development time I'd guess that the amount optimized code running on PCs is about 5% of what it should be.
A large part of that is due to the majority of software not being performance critical anyway.
Another part of it is that the tools we use are shit. Unless you're using an "install from source" distro (e.g. Gentoo) or writing embedded software (where you know the exact target in advance), you're stuck with native binaries where the only option is run-time dispatch; and run-time dispatch isn't supported well by any compiler, so you end up with a painful pile of hacky nonsense to achieve something your tools don't support.
Ironically; this is also half of the reason why JIT (which seems like it should suck badly) is able to get within 90% of the performance of native code - native code simply sucks so badly that JIT (which can optimize for the actual target CPU a little) doesn't seem awful in comparison.
Ideally people would install (sanity checked and pre-optimized) portable byte-code; and the OS would compile it into native code to suit the computer (not just CPU - things like RAM speed can matter too), including "whole program optimization" (where shared libraries are statically linked); and the OS would automatically re-compile the (cached) native code from the original byte-code when necessary (including when shared libraries are updated, or byte-code compiler is updated).
Of course this is mostly orthogonal to (and a distraction from) the "for or against homogenous CPU support" debate.
- Brendan
Jukka Larja (roskakori2006.delete@this.gmail.com) on May 25, 2022 6:24 am wrote:
> Brendan (btrotter.delete@this.gmail.com) on May 24, 2022 5:09 pm wrote:
>
> > a) even if ISAs are exactly the same there could be up to 10% performance/efficiency improvement because
> > lots of optimizations (instruction selection and scheduling, which instructions a fused or not, prefetch
> > scheduling distance, whether branch prediction has aliasing issues with "too many branches too close",
> > which cache size for cache blocking optimizations, ...) depend on micro-arch (and P cores and E cores use
> > very different micro-arch, and ARM's "big" cores and "little" cores use very different micro-arch)
>
> How much CPU model optimized code do you think is running on PCs? I tried to find such parameters
> for Visual Studio, but failed. Doesn't seem to be something developers often do.
Honestly; with the growing number of software developers sacrificing the quality of their end product to reduce development time I'd guess that the amount optimized code running on PCs is about 5% of what it should be.
A large part of that is due to the majority of software not being performance critical anyway.
Another part of it is that the tools we use are shit. Unless you're using an "install from source" distro (e.g. Gentoo) or writing embedded software (where you know the exact target in advance), you're stuck with native binaries where the only option is run-time dispatch; and run-time dispatch isn't supported well by any compiler, so you end up with a painful pile of hacky nonsense to achieve something your tools don't support.
Ironically; this is also half of the reason why JIT (which seems like it should suck badly) is able to get within 90% of the performance of native code - native code simply sucks so badly that JIT (which can optimize for the actual target CPU a little) doesn't seem awful in comparison.
Ideally people would install (sanity checked and pre-optimized) portable byte-code; and the OS would compile it into native code to suit the computer (not just CPU - things like RAM speed can matter too), including "whole program optimization" (where shared libraries are statically linked); and the OS would automatically re-compile the (cached) native code from the original byte-code when necessary (including when shared libraries are updated, or byte-code compiler is updated).
Of course this is mostly orthogonal to (and a distraction from) the "for or against homogenous CPU support" debate.
- Brendan