By: Brett (ggtgp.delete@this.yahoo.com), May 20, 2022 11:52 pm
Room: Moderated Discussions
Andrey (andrey.semashev.delete@this.gmail.com) on May 19, 2022 1:50 pm wrote:
> Brett (ggtgp.delete@this.yahoo.com) on May 19, 2022 11:23 am wrote:
> > Jan Wassenberg (jan.wassenberg.delete@this.gmail.com) on May 18, 2022 10:40 pm wrote:
> > > Brett (ggtgp.delete@this.yahoo.com) on May 18, 2022 11:03 am wrote:
> > >
> > > > High core count CPU’s are already memory starved, so adding AVX512 is pointless.
> > > > With 5nm CPU’s you can scratch off the HPC market needing AVX512, as you can’t
> > > > feed that many CPU’s much less the doubled bandwidth needs of AVX512 units.
> > > The current #1 in HPC has 1 TB/s per socket. AFAIK the bandwidth of SPR-HBM is not
> > > yet known but could be similar. How do we feed those without 512-bit vectors?
> >
> > Wait 2-3 years for the next shrink with twice as many cores,
> > making AVX512 pointless again due to lack of bandwidth.
>
> Multithreading and SIMD are orthogonal and complementary, one will not replace the other. For
> a start, some tasks are inherently easier and more efficient to solve with SIMD while others
> - with threads. There are entirely different costs in terms of software design and developer
> efforts, which may be in favor of one or the other approach to parallelism in every case.
>
> Furthermore, more cores are always more expensive than SIMD for the same throughput, in terms of die space
> and power consumption. SIMD is little more than execution units. A full core is much more than that. To illustrate,
> SSE4.1 is able to perform up to 16 arithmetic operations on individual bytes in parallel. By the time when
> SSE4.1 was new (around Nehalem era), I don't think there were 16-core SKUs, even in server domain. But even
> 10-core SKUs had much higher TDP and lower frequencies than desktop units that had SSE4.1.
>
> > > > Now all 8 cores are down clocked in response and your net performance uplift of AVX512 is negative.
> > > You might find these results surprising: https://travisdowns.github.io/blog/2020/08/19/icl-avx512-freq.html#summary
> > >
> >
> > Not surprised, for tests that fit in L1 cache register operations are
> > measured in picojoules, whereas bandwidth is measured in watts.
> >
> > The real issue is that big OoOE cores are big and hot and expensive, the same vector work can
> > be accomplished by hard coded math blocks at 10% of the die size and 10% of the power.
> >
> > Rather than go from 8 cores to 16, you can stay at 8 and
> > add 80 hard coded blocks to do otherwise impossible
> > tasks like real video compression encoding. This is the path Apple is taking, the better path. ;)
>
> The downside is that your hardware blocks are waste of silicon if your task is not covered
> by its fixed function.
You want 8 extra cores when the first 8 CPU’s are already a waste of silicon for 99% of users, where as the hardware blocks will get used by more than 1% of users.
Software video decode is dying/dead even on Intel CPU’s due to the power cost of that charcoal briquette CPU.
> So when a new video codec comes out you may as well throw away your
> CPU. Which is exactly what Apple wants you to do. Not quite what I'd call a "better path".
It takes a decade for a new graphics standard to get finalized, more likely two decades as standard organizations wait for patents to expire.
Apple is leading in the VR field with its camera depth sensor, and building a VR world that maps the real world requires hardware blocks, not CPU power.
Any new breakthrough is going to require hardware blocks, CPU’s are too slow and hot.
Your VR glasses are not going to be powered by a 16 core CPU in a backpack. ;)
hint, a iPhone is faster than many desktops. Not just because of the awesome 6 wide CPU, but because half of that die is hardware blocks for those new uses you claim you want a slow hot CPU to do.
> Yes, fixed function blocks are good - for a limited set of applications. Primarily, for stuff
> that is ubiquitous and realistically not going away any time soon. Like AES, for example. It
> may be justified to have more specialized fixed function blocks, but only when you know they
> will be in use, i.e. when you target the silicon to a specific domain or a specific customer.
> In all other cases, the hardware blocks must be useful to solve a certain class of tasks.
>
> > Most software cannot make use of more than 2 cores, games can use 4, and up to 8.
> > 16 cores is just stupid for the average user, Microsoft Word will not run faster.
>
> Do you need Word to run faster? Are you limited by its speed?
Yes I would like Word to run faster, auto spell and writing check somehow makes Word slow. It makes my soul burn when Word falls behind my typing. 5 GHz and Word can’t keep up.
> The main driver for having more cores is not office applications or even games. Although games might evolve
> to become more parallel. The main driver is productivity software. Most of that kind of software will happily
> eat all cores you throw at it. Think code compilation, rendering, video compression and so on.
> Brett (ggtgp.delete@this.yahoo.com) on May 19, 2022 11:23 am wrote:
> > Jan Wassenberg (jan.wassenberg.delete@this.gmail.com) on May 18, 2022 10:40 pm wrote:
> > > Brett (ggtgp.delete@this.yahoo.com) on May 18, 2022 11:03 am wrote:
> > >
> > > > High core count CPU’s are already memory starved, so adding AVX512 is pointless.
> > > > With 5nm CPU’s you can scratch off the HPC market needing AVX512, as you can’t
> > > > feed that many CPU’s much less the doubled bandwidth needs of AVX512 units.
> > > The current #1 in HPC has 1 TB/s per socket. AFAIK the bandwidth of SPR-HBM is not
> > > yet known but could be similar. How do we feed those without 512-bit vectors?
> >
> > Wait 2-3 years for the next shrink with twice as many cores,
> > making AVX512 pointless again due to lack of bandwidth.
>
> Multithreading and SIMD are orthogonal and complementary, one will not replace the other. For
> a start, some tasks are inherently easier and more efficient to solve with SIMD while others
> - with threads. There are entirely different costs in terms of software design and developer
> efforts, which may be in favor of one or the other approach to parallelism in every case.
>
> Furthermore, more cores are always more expensive than SIMD for the same throughput, in terms of die space
> and power consumption. SIMD is little more than execution units. A full core is much more than that. To illustrate,
> SSE4.1 is able to perform up to 16 arithmetic operations on individual bytes in parallel. By the time when
> SSE4.1 was new (around Nehalem era), I don't think there were 16-core SKUs, even in server domain. But even
> 10-core SKUs had much higher TDP and lower frequencies than desktop units that had SSE4.1.
>
> > > > Now all 8 cores are down clocked in response and your net performance uplift of AVX512 is negative.
> > > You might find these results surprising: https://travisdowns.github.io/blog/2020/08/19/icl-avx512-freq.html#summary
> > >
> >
> > Not surprised, for tests that fit in L1 cache register operations are
> > measured in picojoules, whereas bandwidth is measured in watts.
> >
> > The real issue is that big OoOE cores are big and hot and expensive, the same vector work can
> > be accomplished by hard coded math blocks at 10% of the die size and 10% of the power.
> >
> > Rather than go from 8 cores to 16, you can stay at 8 and
> > add 80 hard coded blocks to do otherwise impossible
> > tasks like real video compression encoding. This is the path Apple is taking, the better path. ;)
>
> The downside is that your hardware blocks are waste of silicon if your task is not covered
> by its fixed function.
You want 8 extra cores when the first 8 CPU’s are already a waste of silicon for 99% of users, where as the hardware blocks will get used by more than 1% of users.
Software video decode is dying/dead even on Intel CPU’s due to the power cost of that charcoal briquette CPU.
> So when a new video codec comes out you may as well throw away your
> CPU. Which is exactly what Apple wants you to do. Not quite what I'd call a "better path".
It takes a decade for a new graphics standard to get finalized, more likely two decades as standard organizations wait for patents to expire.
Apple is leading in the VR field with its camera depth sensor, and building a VR world that maps the real world requires hardware blocks, not CPU power.
Any new breakthrough is going to require hardware blocks, CPU’s are too slow and hot.
Your VR glasses are not going to be powered by a 16 core CPU in a backpack. ;)
hint, a iPhone is faster than many desktops. Not just because of the awesome 6 wide CPU, but because half of that die is hardware blocks for those new uses you claim you want a slow hot CPU to do.
> Yes, fixed function blocks are good - for a limited set of applications. Primarily, for stuff
> that is ubiquitous and realistically not going away any time soon. Like AES, for example. It
> may be justified to have more specialized fixed function blocks, but only when you know they
> will be in use, i.e. when you target the silicon to a specific domain or a specific customer.
> In all other cases, the hardware blocks must be useful to solve a certain class of tasks.
>
> > Most software cannot make use of more than 2 cores, games can use 4, and up to 8.
> > 16 cores is just stupid for the average user, Microsoft Word will not run faster.
>
> Do you need Word to run faster? Are you limited by its speed?
Yes I would like Word to run faster, auto spell and writing check somehow makes Word slow. It makes my soul burn when Word falls behind my typing. 5 GHz and Word can’t keep up.
> The main driver for having more cores is not office applications or even games. Although games might evolve
> to become more parallel. The main driver is productivity software. Most of that kind of software will happily
> eat all cores you throw at it. Think code compilation, rendering, video compression and so on.