To elaborate

By: Adrian (, July 13, 2020 1:45 pm
Room: Moderated Discussions
Anon ( on July 13, 2020 7:36 am wrote:

> I understand, but I don't think it is relevant alone, there is power too and a complicated trade
> of, if it was just area one could argue that you could reduce the per core L2 to increase the number
> of cores, maybe power efficiency says you can't. When running AVX-512 code Intel CPUs reduce frequency,
> how many non-AVX512 cores would be possible running at that reduced frequency?

Obviously, when the target would be to execute some code that cannot benefit from AVX-512, then it would be more efficient to remove the AVX-512 hardware and add more cores instead.

On the other hand, if the application can use something like AVX-512 and you do not want to use an off-chip accelerator for latency reasons, then it is impossible to beat the AVX-512 solution by adding more cores, despite the reduced clock frequency.

The reason is that for Skylake-like CPUs, when doing 128-bit FP computations only about 50% of the power consumption goes to the actual computation and the other 50% goes to control or other indirect functionality. With AVX-512, about 80% of the power goes to the actual computation.

Therefore, for a given power limit per socket, a CPU with 512-bit SIMD will always beat any CPU with narrower FP units.

That does not necessarily mean that it is good to have maximum vector width on all cores. I believe that an asymmetrical dual-socket could be faster, e.g. if in one socket you would have an 8-core or 16-core CPU with maximum clock frequency, which will not be down-clocked when the second socket consumes maximum power, the second socket having e.g. a 64-core CPU with dual 512-bit FMA.

Lacking such a hybrid computer, I am using the traditional solution of having a workstation with a medium-core-count high-frequency CPU, connected via Ethernet to several computational servers with high-throughput high-core-count lower-frequency CPUs.

