By: Adrian (a.delete@this.acm.org), May 29, 2022 4:33 am
Room: Moderated Discussions
Jan Wassenberg (jan.wassenberg.delete@this.gmail.com) on May 29, 2022 12:51 am wrote:
> Adrian (a.delete@this.acm.org) on May 24, 2022 2:39 pm wrote:
> > I have not tried this on more recent Intel CPUs, but in a measurement on Skylake Server CPUs
> > (with 2 512-bit FMA units) done a few years ago, the ratio between the energies needed to
> > compute some LINPACK benchmark in AVX-512 and in AVX2 (i.e. with 256-bit FMA/LD/ST) modes
> > was around 5/6, so a little more than your maximum estimation, but not much more.
> Interesting, can you share some pointers on how this was measured so I can try it for
> AVX-512 vs scalar? (I suspect that is a much larger difference than AVX2 vs AVX-512.)
The starting point is having a BLAS library that includes optimized variants for the various ISA options, e.g. scalar, 128-bit SSE2, 128-bit AVX, 256-bit AVX, 256-bit AVX-512, 512-bit AVX-512.
There are many such BLAS libraries, both open-source and proprietary.
Then one can choose some linear algebra benchmark, e.g. LINPACK or one based on DGEMM, and compile and link it for all the ISA variants that must be tested.
The problem size should be large enough so that the running time of the benchmark would be large enough, e.g. 5 to 10 minutes, so that most of the time would be spent at the steady-state power consumption and clock frequencies.
Then you can write a script to run all the ISA-dependent executables, preferably running each of them for various number of active threads (and pinning the threads to cores), to obtain more data, e.g. the dependence of the clock frequency, of the power consumption and of the total energy on the number of active cores, not only on the ISA used.
There are a lot of devices that can be inserted between the wall plug and the PSU cable, to measure the complete power and energy of the computer system, and display them on a LCD screen. You can read the total energy, the average power and the maximum power after each test run, and then reset the device for the next test. There are more expensive such devices that could be read directly by a computer.
The power consumed by the CPU cores and their clock frequencies and temperature can be sampled periodically during the test by the test script, by reading them from the CPU internal sensors, with programs like turbostat (an utility supplied by the Linux kernel) or many other such programs.
The power consumption samples can be integrated over the test time, giving the energy consumed in the CPU alone.
The easiest is to run tests only to measure the energy consumed by the CPU, as that needs only free software. To measure the energy consumed by the entire computer system, you also need the hardware measurement device, but that is not expensive and it is useful for many purposes.
> Adrian (a.delete@this.acm.org) on May 24, 2022 2:39 pm wrote:
> > I have not tried this on more recent Intel CPUs, but in a measurement on Skylake Server CPUs
> > (with 2 512-bit FMA units) done a few years ago, the ratio between the energies needed to
> > compute some LINPACK benchmark in AVX-512 and in AVX2 (i.e. with 256-bit FMA/LD/ST) modes
> > was around 5/6, so a little more than your maximum estimation, but not much more.
> Interesting, can you share some pointers on how this was measured so I can try it for
> AVX-512 vs scalar? (I suspect that is a much larger difference than AVX2 vs AVX-512.)
The starting point is having a BLAS library that includes optimized variants for the various ISA options, e.g. scalar, 128-bit SSE2, 128-bit AVX, 256-bit AVX, 256-bit AVX-512, 512-bit AVX-512.
There are many such BLAS libraries, both open-source and proprietary.
Then one can choose some linear algebra benchmark, e.g. LINPACK or one based on DGEMM, and compile and link it for all the ISA variants that must be tested.
The problem size should be large enough so that the running time of the benchmark would be large enough, e.g. 5 to 10 minutes, so that most of the time would be spent at the steady-state power consumption and clock frequencies.
Then you can write a script to run all the ISA-dependent executables, preferably running each of them for various number of active threads (and pinning the threads to cores), to obtain more data, e.g. the dependence of the clock frequency, of the power consumption and of the total energy on the number of active cores, not only on the ISA used.
There are a lot of devices that can be inserted between the wall plug and the PSU cable, to measure the complete power and energy of the computer system, and display them on a LCD screen. You can read the total energy, the average power and the maximum power after each test run, and then reset the device for the next test. There are more expensive such devices that could be read directly by a computer.
The power consumed by the CPU cores and their clock frequencies and temperature can be sampled periodically during the test by the test script, by reading them from the CPU internal sensors, with programs like turbostat (an utility supplied by the Linux kernel) or many other such programs.
The power consumption samples can be integrated over the test time, giving the energy consumed in the CPU alone.
The easiest is to run tests only to measure the energy consumed by the CPU, as that needs only free software. To measure the energy consumed by the entire computer system, you also need the hardware measurement device, but that is not expensive and it is useful for many purposes.