By: Jeff S. (fakity.delete@this.fake.com), August 24, 2018 9:24 am
Room: Moderated Discussions
Travis (travis.downs.delete@this.gmail.com) on August 24, 2018 8:41 am wrote:
> I haven't tried to look at the transition behavior yet - these measurements are all basically
> "fully warmed up" because I do a lot of warmup iterations and then use the median results
> from many trials. I check both the actual CPU frequency and the "implied" frequency which
> is based on the number of operations executed and their known latency and in every case the
> actual and implied frequencies are consistent, so the CPU is running at full speed.
>
> I have no doubt that there is a transition period where things are weird though: there is also such a period
> on Skylake client (and some Haswell chips) when any 256-bit operation is executed for the first time in
> a while, and during this period 256-bit operations execute at about 25% of their usual throughput.
>
> There is a chance though that even in steady state, AVX-512 FMA operations are being executed
> via dual issue and reassembled and that this happens without any loss in performance, including
> no latency effect (after all, this is a 1 AVX2-FMA chip, so the total throughput is the same for
> AVX2 or AVX-512 FMA). That seems unlikely (and in any case, what is the distinction between parallel
> dual issue of 2 256-bit FMAs or single issue of 1 512-bit to the same hardware)?
>
> Note that I always saw some downclocking with AVX-512: I never saw a case where AVX-512 ran
> at the full 3.2 GHz base frequency: it was always at the 2.8 GHz or 2.4 GHz tier. The oddity
> was that it was running at 2.8 in some cases where I'd expect it to have been at 2.4.
We have seen Skylake being smarter (or more protective of performance) than Broadwell regarding which particular instructions hit which downclock, but I think what you are saying about there always being some hit matches what we've seen. The dual issue of 256b makes sense given that the second 512b FMA has an extra cycle penalty on its exec port (5?) that presumably should be avoided when at all possible.
> I haven't tried to look at the transition behavior yet - these measurements are all basically
> "fully warmed up" because I do a lot of warmup iterations and then use the median results
> from many trials. I check both the actual CPU frequency and the "implied" frequency which
> is based on the number of operations executed and their known latency and in every case the
> actual and implied frequencies are consistent, so the CPU is running at full speed.
>
> I have no doubt that there is a transition period where things are weird though: there is also such a period
> on Skylake client (and some Haswell chips) when any 256-bit operation is executed for the first time in
> a while, and during this period 256-bit operations execute at about 25% of their usual throughput.
>
> There is a chance though that even in steady state, AVX-512 FMA operations are being executed
> via dual issue and reassembled and that this happens without any loss in performance, including
> no latency effect (after all, this is a 1 AVX2-FMA chip, so the total throughput is the same for
> AVX2 or AVX-512 FMA). That seems unlikely (and in any case, what is the distinction between parallel
> dual issue of 2 256-bit FMAs or single issue of 1 512-bit to the same hardware)?
>
> Note that I always saw some downclocking with AVX-512: I never saw a case where AVX-512 ran
> at the full 3.2 GHz base frequency: it was always at the 2.8 GHz or 2.4 GHz tier. The oddity
> was that it was running at 2.8 in some cases where I'd expect it to have been at 2.4.
We have seen Skylake being smarter (or more protective of performance) than Broadwell regarding which particular instructions hit which downclock, but I think what you are saying about there always being some hit matches what we've seen. The dual issue of 256b makes sense given that the second 512b FMA has an extra cycle penalty on its exec port (5?) that presumably should be avoided when at all possible.