By: Adrian (a.delete@this.acm.org), August 3, 2022 2:00 am
Room: Moderated Discussions
Anon (no.delete@this.spam.com) on August 2, 2022 11:46 am wrote:
>
> Getting quality data is much harder than you think, unlike you I have seen a lot of empirical
> data, often contradictory data, the problem is that the data depends on compiler technology,
> the existing ISAs, the workload, etc, improve one side and the other becomes suboptimal.
>
The dependence on compiler technology is the greatest difficulty in obtaining quality data.
Otherwise the static frequencies for any features of the instructions or of their encodings could be very easy to obtain from, e.g. the entire collection of packages of a Linux distribution. The dynamic frequencies need much more time to collect and also test input data, but using options to compile instrumented code makes that possible.
However such data obtained from compiled programs is only partially useful. There is a vicious circle where the compilers avoid using certain instructions, certain addressing modes, certain ways of generating immediate constants etc., because those happen to have a poor implementation on the current CPUs.
Then the results of collecting such empirical data shows that those features are not used by compilers, which is used to justify that in the next CPU designs such features can also be implemented poorly, or they can be completely omitted, when not forced to keep them for compatibility.
So what the current compilers generate seldom proves how the best ISA would look.
The most accurate empirical data can be generated only when carefully translating by hand a set of representative programs into optimal sequences of machine instructions and then recording the frequencies of whatever is desired.
Of course, manual translation of an adequate number of programs takes too much, so this is also not a solution.
Therefore I believe that a compromise must be made when an accurate comparison of different ISAs is intended.
One must select a reasonably large number of short programs that are considered typical for applications, in the spirit of the Livermore Loops. These must be translated by hand for every ISA that is compared. Gathering statistics over these would provide one set of empirical data.
The second set of empirical data, from compiled programs, is easier to gather.
For static frequencies that is quite trivial to do if you use some OS distribution that compiles everything from source code, e.g. a Gentoo Linux or a FreeBSD, which you could use in a VM or emulator to compile everything for a given target ISA and with given compiler flags.
After compiling the static frequencies from the entire set of programs can be gathered trivially for any instruction feature, e.g. to count the instructions with the same mnemonic and sort them by frequency:
objdump -d /usr/bin/* | cut -f3 | grep -oE "^[a-z]+" | sort | uniq -c | sort -n
This is just the simplest example. Classifying the immediate constants by size ranges would require more sophisticated parsing of the disassembly output, but that would still not be difficult for a Python/Perl/AWK script.
For dynamic frequencies most work would be in generating representative test input data, to be able to run the programs for gathering the statistics.
How to weigh the results from the 2 sets of empirical data, is unlikely to have an objective answer.
>
> Getting quality data is much harder than you think, unlike you I have seen a lot of empirical
> data, often contradictory data, the problem is that the data depends on compiler technology,
> the existing ISAs, the workload, etc, improve one side and the other becomes suboptimal.
>
The dependence on compiler technology is the greatest difficulty in obtaining quality data.
Otherwise the static frequencies for any features of the instructions or of their encodings could be very easy to obtain from, e.g. the entire collection of packages of a Linux distribution. The dynamic frequencies need much more time to collect and also test input data, but using options to compile instrumented code makes that possible.
However such data obtained from compiled programs is only partially useful. There is a vicious circle where the compilers avoid using certain instructions, certain addressing modes, certain ways of generating immediate constants etc., because those happen to have a poor implementation on the current CPUs.
Then the results of collecting such empirical data shows that those features are not used by compilers, which is used to justify that in the next CPU designs such features can also be implemented poorly, or they can be completely omitted, when not forced to keep them for compatibility.
So what the current compilers generate seldom proves how the best ISA would look.
The most accurate empirical data can be generated only when carefully translating by hand a set of representative programs into optimal sequences of machine instructions and then recording the frequencies of whatever is desired.
Of course, manual translation of an adequate number of programs takes too much, so this is also not a solution.
Therefore I believe that a compromise must be made when an accurate comparison of different ISAs is intended.
One must select a reasonably large number of short programs that are considered typical for applications, in the spirit of the Livermore Loops. These must be translated by hand for every ISA that is compared. Gathering statistics over these would provide one set of empirical data.
The second set of empirical data, from compiled programs, is easier to gather.
For static frequencies that is quite trivial to do if you use some OS distribution that compiles everything from source code, e.g. a Gentoo Linux or a FreeBSD, which you could use in a VM or emulator to compile everything for a given target ISA and with given compiler flags.
After compiling the static frequencies from the entire set of programs can be gathered trivially for any instruction feature, e.g. to count the instructions with the same mnemonic and sort them by frequency:
objdump -d /usr/bin/* | cut -f3 | grep -oE "^[a-z]+" | sort | uniq -c | sort -n
This is just the simplest example. Classifying the immediate constants by size ranges would require more sophisticated parsing of the disassembly output, but that would still not be difficult for a Python/Perl/AWK script.
For dynamic frequencies most work would be in generating representative test input data, to be able to run the programs for gathering the statistics.
How to weigh the results from the 2 sets of empirical data, is unlikely to have an objective answer.