By: Andrey (andrey.semashev.delete@this.gmail.com), May 19, 2022 1:50 pm
Room: Moderated Discussions
Brett (ggtgp.delete@this.yahoo.com) on May 19, 2022 11:23 am wrote:
> Jan Wassenberg (jan.wassenberg.delete@this.gmail.com) on May 18, 2022 10:40 pm wrote:
> > Brett (ggtgp.delete@this.yahoo.com) on May 18, 2022 11:03 am wrote:
> >
> > > High core count CPU’s are already memory starved, so adding AVX512 is pointless.
> > > With 5nm CPU’s you can scratch off the HPC market needing AVX512, as you can’t
> > > feed that many CPU’s much less the doubled bandwidth needs of AVX512 units.
> > The current #1 in HPC has 1 TB/s per socket. AFAIK the bandwidth of SPR-HBM is not
> > yet known but could be similar. How do we feed those without 512-bit vectors?
>
> Wait 2-3 years for the next shrink with twice as many cores,
> making AVX512 pointless again due to lack of bandwidth.
Multithreading and SIMD are orthogonal and complementary, one will not replace the other. For a start, some tasks are inherently easier and more efficient to solve with SIMD while others - with threads. There are entirely different costs in terms of software design and developer efforts, which may be in favor of one or the other approach to parallelism in every case.
Furthermore, more cores are always more expensive than SIMD for the same throughput, in terms of die space and power consumption. SIMD is little more than execution units. A full core is much more than that. To illustrate, SSE4.1 is able to perform up to 16 arithmetic operations on individual bytes in parallel. By the time when SSE4.1 was new (around Nehalem era), I don't think there were 16-core SKUs, even in server domain. But even 10-core SKUs had much higher TDP and lower frequencies than desktop units that had SSE4.1.
> > > Now all 8 cores are down clocked in response and your net performance uplift of AVX512 is negative.
> > You might find these results surprising: https://travisdowns.github.io/blog/2020/08/19/icl-avx512-freq.html#summary
>
> Not surprised, for tests that fit in L1 cache register operations are
> measured in picojoules, whereas bandwidth is measured in watts.
>
> The real issue is that big OoOE cores are big and hot and expensive, the same vector work can
> be accomplished by hard coded math blocks at 10% of the die size and 10% of the power.
>
> Rather than go from 8 cores to 16, you can stay at 8 and add 80 hard coded blocks to do otherwise impossible
> tasks like real video compression encoding. This is the path Apple is taking, the better path. ;)
The downside is that your hardware blocks are waste of silicon if your task is not covered by its fixed function. So when a new video codec comes out you may as well throw away your CPU. Which is exactly what Apple wants you to do. Not quite what I'd call a "better path".
Yes, fixed function blocks are good - for a limited set of applications. Primarily, for stuff that is ubiquitous and realistically not going away any time soon. Like AES, for example. It may be justified to have more specialized fixed function blocks, but only when you know they will be in use, i.e. when you target the silicon to a specific domain or a specific customer. In all other cases, the hardware blocks must be useful to solve a certain class of tasks.
> Most software cannot make use of more than 2 cores, games can use 4, and up to 8.
> 16 cores is just stupid for the average user, Microsoft Word will not run faster.
Do you need Word to run faster? Are you limited by its speed?
The main driver for having more cores is not office applications or even games. Although games might evolve to become more parallel. The main driver is productivity software. Most of that kind of software will happily eat all cores you throw at it. Think code compilation, rendering, video compression and so on.
> Jan Wassenberg (jan.wassenberg.delete@this.gmail.com) on May 18, 2022 10:40 pm wrote:
> > Brett (ggtgp.delete@this.yahoo.com) on May 18, 2022 11:03 am wrote:
> >
> > > High core count CPU’s are already memory starved, so adding AVX512 is pointless.
> > > With 5nm CPU’s you can scratch off the HPC market needing AVX512, as you can’t
> > > feed that many CPU’s much less the doubled bandwidth needs of AVX512 units.
> > The current #1 in HPC has 1 TB/s per socket. AFAIK the bandwidth of SPR-HBM is not
> > yet known but could be similar. How do we feed those without 512-bit vectors?
>
> Wait 2-3 years for the next shrink with twice as many cores,
> making AVX512 pointless again due to lack of bandwidth.
Multithreading and SIMD are orthogonal and complementary, one will not replace the other. For a start, some tasks are inherently easier and more efficient to solve with SIMD while others - with threads. There are entirely different costs in terms of software design and developer efforts, which may be in favor of one or the other approach to parallelism in every case.
Furthermore, more cores are always more expensive than SIMD for the same throughput, in terms of die space and power consumption. SIMD is little more than execution units. A full core is much more than that. To illustrate, SSE4.1 is able to perform up to 16 arithmetic operations on individual bytes in parallel. By the time when SSE4.1 was new (around Nehalem era), I don't think there were 16-core SKUs, even in server domain. But even 10-core SKUs had much higher TDP and lower frequencies than desktop units that had SSE4.1.
> > > Now all 8 cores are down clocked in response and your net performance uplift of AVX512 is negative.
> > You might find these results surprising: https://travisdowns.github.io/blog/2020/08/19/icl-avx512-freq.html#summary
>
> Not surprised, for tests that fit in L1 cache register operations are
> measured in picojoules, whereas bandwidth is measured in watts.
>
> The real issue is that big OoOE cores are big and hot and expensive, the same vector work can
> be accomplished by hard coded math blocks at 10% of the die size and 10% of the power.
>
> Rather than go from 8 cores to 16, you can stay at 8 and add 80 hard coded blocks to do otherwise impossible
> tasks like real video compression encoding. This is the path Apple is taking, the better path. ;)
The downside is that your hardware blocks are waste of silicon if your task is not covered by its fixed function. So when a new video codec comes out you may as well throw away your CPU. Which is exactly what Apple wants you to do. Not quite what I'd call a "better path".
Yes, fixed function blocks are good - for a limited set of applications. Primarily, for stuff that is ubiquitous and realistically not going away any time soon. Like AES, for example. It may be justified to have more specialized fixed function blocks, but only when you know they will be in use, i.e. when you target the silicon to a specific domain or a specific customer. In all other cases, the hardware blocks must be useful to solve a certain class of tasks.
> Most software cannot make use of more than 2 cores, games can use 4, and up to 8.
> 16 cores is just stupid for the average user, Microsoft Word will not run faster.
Do you need Word to run faster? Are you limited by its speed?
The main driver for having more cores is not office applications or even games. Although games might evolve to become more parallel. The main driver is productivity software. Most of that kind of software will happily eat all cores you throw at it. Think code compilation, rendering, video compression and so on.