By: Jeff S. (fakity.delete@this.fake.com), August 24, 2018 6:22 am
Room: Moderated Discussions
Travis (travis.downs.delete@this.gmail.com) on August 24, 2018 1:01 am wrote:
> With chips that support AVX-512 you have three speed "tiers" the CPU can be running in, let's call them "base",
> "avx2" and "avx-512" speeds, from fastest to slowest. This is described on wikichip and other places.
>
> The tiers aren't actually strictly defined by AVX2 and AVX-512: there are light and
> heavy instructions, so only heavy AVX-512 have to suffer the AVX-512 speed, light AVX-512
> is grouped with heavy AVX-256, and light AVX-256 could be as cheap as scalar.
>
> Earlier energy saving transitions relating to AVX2 kicked in as soon as you execute a single
> instruction, and the assumption seems to be that this is true for this stuff too: e.g., if
> you run one heavy AVX-512 instruction, you'll get downclocked to the slow AVX-512 speed.
>
> This doesn't seem to be the case, at least based on the one Skylake-W 2104 CPU I tested. The three
> tiers exist, but you only drop down if you run "enough" of the heavy instructions. Enough is quite
> a bit: even a long loop of back-to-back serial AVX2 FMAs (which execute in this case at one per cycle
> since they are dependent) doesn't trigger the AVX2 downclock (it runs at the full nominal "scalar"
> speed of 3.2 GHz). Similarly for AVX-512: although any AVX-512 instruction triggers downclocking
> to the middle AVX2 tier, even a stream of 1 FMAD every 4 or even 2 cycles doesn't set it down lower.
> The lowest speed is only reached if FMAs come at a rate of more than 1 every 2 cycles.
>
> Note that this rule only applies to the "heavy" instructions. Light instructions do seem to activate
> their speed even if they are very sparse: but since AVX2 light is in the fastest tier, this only
> matters for light AVX-512 instructions, which will put the CPU into the AVX2 speed tier.
>
> So at least with respect to the "heavy" instructions, you
> can use quite a few before hitting the dreaded downclock.
>
> If anyone wants to run my test on a more interesting system than the W-2104, you can find it here:
>
> https://github.com/travisdowns/avx-turbo
>
> By default it will run on 1 core up to the number of cores you have so you can see
> the progression as active cores increase. In particular, you can use it to verify if
> you're getting the published relationship between code type and active core count.
The evidence Alex saw in his early investigations with y-cruncher suggested (IIRC) that warming up fuller widths of the FMA banks was what triggered downclocking, but that the effects had a non-trivial lag during which execution might have been handled through dual issue of half-width operations. Your new turbo tester may or may not be more accurate in measuring this, but at least keep his hypothesis in mind. He's still out in CA after Hot Chips visiting friends and family (and hopefully not actually working right now...), but you mind find some of the discussions he's had on this on Agner's blog or the Mersenne forum until he gets back.
> With chips that support AVX-512 you have three speed "tiers" the CPU can be running in, let's call them "base",
> "avx2" and "avx-512" speeds, from fastest to slowest. This is described on wikichip and other places.
>
> The tiers aren't actually strictly defined by AVX2 and AVX-512: there are light and
> heavy instructions, so only heavy AVX-512 have to suffer the AVX-512 speed, light AVX-512
> is grouped with heavy AVX-256, and light AVX-256 could be as cheap as scalar.
>
> Earlier energy saving transitions relating to AVX2 kicked in as soon as you execute a single
> instruction, and the assumption seems to be that this is true for this stuff too: e.g., if
> you run one heavy AVX-512 instruction, you'll get downclocked to the slow AVX-512 speed.
>
> This doesn't seem to be the case, at least based on the one Skylake-W 2104 CPU I tested. The three
> tiers exist, but you only drop down if you run "enough" of the heavy instructions. Enough is quite
> a bit: even a long loop of back-to-back serial AVX2 FMAs (which execute in this case at one per cycle
> since they are dependent) doesn't trigger the AVX2 downclock (it runs at the full nominal "scalar"
> speed of 3.2 GHz). Similarly for AVX-512: although any AVX-512 instruction triggers downclocking
> to the middle AVX2 tier, even a stream of 1 FMAD every 4 or even 2 cycles doesn't set it down lower.
> The lowest speed is only reached if FMAs come at a rate of more than 1 every 2 cycles.
>
> Note that this rule only applies to the "heavy" instructions. Light instructions do seem to activate
> their speed even if they are very sparse: but since AVX2 light is in the fastest tier, this only
> matters for light AVX-512 instructions, which will put the CPU into the AVX2 speed tier.
>
> So at least with respect to the "heavy" instructions, you
> can use quite a few before hitting the dreaded downclock.
>
> If anyone wants to run my test on a more interesting system than the W-2104, you can find it here:
>
> https://github.com/travisdowns/avx-turbo
>
> By default it will run on 1 core up to the number of cores you have so you can see
> the progression as active cores increase. In particular, you can use it to verify if
> you're getting the published relationship between code type and active core count.
The evidence Alex saw in his early investigations with y-cruncher suggested (IIRC) that warming up fuller widths of the FMA banks was what triggered downclocking, but that the effects had a non-trivial lag during which execution might have been handled through dual issue of half-width operations. Your new turbo tester may or may not be more accurate in measuring this, but at least keep his hypothesis in mind. He's still out in CA after Hot Chips visiting friends and family (and hopefully not actually working right now...), but you mind find some of the discussions he's had on this on Agner's blog or the Mersenne forum until he gets back.