AVX-512 unfriendly to heter-performance cores

By: Brett (ggtgp.delete@this.yahoo.com), July 31, 2022 7:26 pm
Room: Moderated Discussions
Paul A. Clayton (paaronclayton.delete@this.gmail.com) on July 31, 2022 3:20 pm wrote:
> Adrian (a.delete@this.acm.org) on July 30, 2022 9:27 am wrote:
> [snip]
> > AMD has stated clearly that there will be no ISA difference between the compact core and the big core,
> > so it must also support AVX-512. This was obviously meant to contrast with their competition, who
> > always had such differences, e.g. Denverton vs. Skylake Server (launched at the same time).
>
> While such ISA compatibility has significant benefits, I suspect

That just cuts the size of the OoO window in half when using lots of AVX-512 registers, and AVX code tends to be cache friendly meaning you don’t need a big OoO window.

> While a "low performance"
> core would not have to have as much register renaming and would presumably have fewer execution resources, the
> large state still feels burdensome. Some resources could be shared between cores in a Bulldozer-like design.
> Some resources might shared between threads (if multithreading is supported in a low performance core). GPU-style
> wider register access than execution and traditional banking could reduce the port count and thus area, but
> a 8 KiB single-ported register file would still seem to be kind of chunky. With sharing across cores, if AVX-512
> is lightly used by both cores or heavily used by one, the active state could be smaller.
>
> Narrower execution and sharing across threads would tend to reduce
> the need for rename registers by hiding some latency.
>
> (I suppose, if full AVX-512 use is expected to be uncommon for small cores, it might be practical to steal
> storage from L1 cache to provide enough state for correct operation when more than eight or so registers
> are used. Even AVX2's 16 32-byte registers seems large for a core oriented for size. In terms of energy/power,
> power gating may minimize the effect of more architected state when the state is unused.)
>
> Another consideration is whether smaller cores are designed for throughput — where wider execution would
> be desirable — or power/area. SIMD-friendly workloads would benefit less from out-of-order execution and
> other area/power-expensive features, so one could imagine a smaller core providing relatively high performance
> for such workloads but not preforming as well as a great-big-OoO design for "general purpose" workloads.
>
> > For now, it is certain that the compact core will have less cache, but, as you say, it might also have
> > vector units of reduced width. According to leaked benchmarks, Genoa appears to have an identical AVX-512
> > throughput per core with Sapphire Rapids, which implies two 512-bit FMA units, so if not with a reduced
> > width, the compact core might be simplified at least by having only one FMA unit instead of two.
>
> While I do not have a suggestion for an architecture better than AVX-512 in terms of supporting blocking (it
> seems the anti-SIMD/scalable vector architectures orient toward streaming with little value reuse), efficiently
> exploiting data level parallelism, and supporting a diverse set of reasonable microarchitectures, I think AVX-512's
> flat state explosion makes small implementations problematic. I suspect even an architecture aware of data
> reuse would need to consider lifetime, access pattern, and scale, so a software-exposed storage hierarchy would
> probably not be sufficient. I also suspect that routing factors should be considered; while SIMD simplifies
> data and instruction routing (a single instruction is "broadcast to all functional units" and data forwarding/dependencies
> are common for the entire width), there may be opportunities for simplifying data routing. E.g., local conversion
> of array-of-structures to structure-of-arrays format [which is reminiscent of matrix transposition and probably
> part of a broader class of data rearrangements] could reduce work — compared to multiple strided vector loads
> implemented straightforwardly — and might be cacheable.
>
> I suspect the concepts from "cache oblivious" software have some application to scalable hardware interfaces,
> but I also suspect that a latency (and bandwidth) hierarchy is not sufficient. While microarchitecture
> can compensate for architectural limitations, better interfaces still seem to have value.

< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Yitian 710 anonymous22021/10/20 08:57 PM
  Yitian 710 Adrian2021/10/21 12:20 AM
  Yitian 710 Wilco2021/10/21 03:47 AM
    Yitian 710 Rayla2021/10/21 05:52 AM
      Yitian 710 Wilco2021/10/21 11:59 AM
        Yitian 710 anon22021/10/21 05:16 PM
        Yitian 710 Wilco2022/07/16 12:21 PM
          Yitian 710 Anon2022/07/16 08:22 PM
            Yitian 710 Rayla2022/07/17 09:10 AM
              Yitian 710 Anon2022/07/17 12:04 PM
                Yitian 710 Rayla2022/07/17 12:08 PM
                  Yitian 710 Wilco2022/07/17 01:16 PM
                    Yitian 710 Anon2022/07/17 01:32 PM
                      Yitian 710 Wilco2022/07/17 02:22 PM
                        Yitian 710 Anon2022/07/17 02:47 PM
                          Yitian 710 Wilco2022/07/17 03:50 PM
                            Yitian 710 Anon2022/07/17 08:46 PM
                              Yitian 710 Wilco2022/07/18 03:01 AM
                                Yitian 710 Anon2022/07/19 11:21 AM
                                  Yitian 710 Wilco2022/07/19 06:15 PM
                                    Yitian 710 Anon2022/07/21 01:25 AM
                                      Yitian 710 none2022/07/21 01:49 AM
                                        Yitian 710 Anon2022/07/21 03:03 AM
                                          Yitian 710 none2022/07/21 04:34 AM
                                      Yitian 710 James2022/07/21 02:29 AM
                                        Yitian 710 Anon2022/07/21 03:05 AM
                                      Yitian 710 Wilco2022/07/21 04:31 AM
                                        Yitian 710 Anon2022/07/21 05:17 AM
                                          Yitian 710 Wilco2022/07/21 05:33 AM
                                            Yitian 710 Anon2022/07/21 05:50 AM
                                              Yitian 710 Wilco2022/07/21 06:07 AM
                                                Yitian 710 Anon2022/07/21 06:20 AM
                                                  Yitian 710 Wilco2022/07/21 10:02 AM
                                                    Yitian 710 Anon2022/07/21 10:22 AM
                    Yitian 710 Adrian2022/07/17 11:09 PM
                      Yitian 710 Wilco2022/07/18 01:15 AM
                        Yitian 710 Adrian2022/07/18 02:35 AM
          Yitian 710 Adrian2022/07/16 11:19 PM
            Computations on Big IntegersBill G2022/07/25 10:06 PM
              Computations on Big Integersnone2022/07/25 11:35 PM
                x86 MUL 64x64 Eric Fink2022/07/26 01:06 AM
                  x86 MUL 64x64 Adrian2022/07/26 02:27 AM
                  x86 MUL 64x64 none2022/07/26 02:38 AM
                    x86 MUL 64x64 Jörn Engel2022/07/26 10:17 AM
                      x86 MUL 64x64 Linus Torvalds2022/07/27 10:13 AM
                        x86 MUL 64x64 2022/07/28 09:40 AM
                        x86 MUL 64x64 Jörn Engel2022/07/28 10:18 AM
                          More than 3 registers per instruction-.-2022/07/28 07:01 PM
                            More than 3 registers per instructionAnon2022/07/28 10:39 PM
                            More than 3 registers per instructionJörn Engel2022/07/28 10:42 PM
                              More than 3 registers per instruction-.-2022/07/29 04:31 AM
                Computations on Big IntegersBill G2022/07/26 01:40 AM
                  Computations on Big Integersnone2022/07/26 02:17 AM
                    Computations on Big IntegersBill G2022/07/26 03:52 AM
                    Computations on Big Integers---2022/07/26 09:57 AM
                  Computations on Big IntegersAdrian2022/07/26 02:53 AM
                    Computations on Big IntegersBill G2022/07/26 03:39 AM
                      Computations on Big IntegersAdrian2022/07/26 04:21 AM
                    Computations on Big Integers in Apple AMX UnitsBill G2022/07/26 04:28 AM
                      Computations on Big Integers in Apple AMX UnitsAdrian2022/07/26 05:13 AM
                        TypoAdrian2022/07/26 05:20 AM
                          IEEE binary64 is 53 bits rather than 52. (NT)Michael S2022/07/26 05:34 AM
                            IEEE binary64 is 53 bits rather than 52.Adrian2022/07/26 07:32 AM
                              IEEE binary64 is 53 bits rather than 52.Michael S2022/07/26 10:02 AM
                                IEEE binary64 is 53 bits rather than 52.Adrian2022/07/27 06:58 AM
                                  IEEE binary64 is 53 bits rather than 52.none2022/07/27 07:14 AM
                                    IEEE binary64 is 53 bits rather than 52.Adrian2022/07/27 07:55 AM
                                      Thanks a lot for the link to the article! (NT)none2022/07/27 08:09 AM
                          TypozArchJon2022/07/26 09:51 AM
                            TypoMichael S2022/07/26 10:25 AM
                              TypozArchJon2022/07/26 11:52 AM
                                TypoMichael S2022/07/26 01:02 PM
                    Computations on Big IntegersMichael S2022/07/26 05:55 AM
                      Computations on Big IntegersAdrian2022/07/26 07:59 AM
                        IFMA and DivisionBill G2022/07/26 04:25 PM
                          IFMA and Divisionrwessel2022/07/26 08:16 PM
                          IFMA and DivisionAdrian2022/07/27 07:25 AM
                      Computations on Big Integersnone2022/07/27 01:22 AM
                    Big integer multiplication with vector IFMABill G2022/07/29 01:06 AM
                      Big integer multiplication with vector IFMAAdrian2022/07/29 01:35 AM
                        Big integer multiplication with vector IFMA-.-2022/07/29 04:32 AM
                          Big integer multiplication with vector IFMAAdrian2022/07/29 09:47 PM
                            Big integer multiplication with vector IFMAAnon2022/07/30 08:12 AM
                              Big integer multiplication with vector IFMAAdrian2022/07/30 09:27 AM
                                AVX-512 unfriendly to heter-performance coresPaul A. Clayton2022/07/31 03:20 PM
                                  AVX-512 unfriendly to heter-performance coresAnon2022/07/31 03:33 PM
                                    AVX-512 unfriendly to heter-performance coresanonymou52022/07/31 05:03 PM
                                  AVX-512 unfriendly to heter-performance coresBrett2022/07/31 07:26 PM
                                  AVX-512 unfriendly to heter-performance coresAdrian2022/08/01 01:45 AM
                                    Why can't E-cores have narrow/slow AVX-512? (NT)anonymous22022/08/01 03:37 PM
                                      Why can't E-cores have narrow/slow AVX-512?Ivan2022/08/02 12:09 AM
                                        Why can't E-cores have narrow/slow AVX-512?anonymou52022/08/02 10:13 AM
                                        Why can't E-cores have narrow/slow AVX-512?Dummond D. Slow2022/08/02 03:02 PM
                                    AVX-512 unfriendly to heter-performance coresPaul A. Clayton2022/08/02 01:19 PM
                                      AVX-512 unfriendly to heter-performance coresAnon2022/08/02 09:09 PM
                                      AVX-512 unfriendly to heter-performance coresAdrian2022/08/03 12:50 AM
                                        AVX-512 unfriendly to heter-performance coresAnon2022/08/03 09:15 AM
                                          AVX-512 unfriendly to heter-performance cores-.-2022/08/03 08:17 PM
                                            AVX-512 unfriendly to heter-performance coresAnon2022/08/03 09:02 PM
                        IFMA: empty promises from Intel as usualKent R2022/07/29 07:15 PM
                          No hype lasts foreverAnon2022/07/30 08:06 AM
                        Big integer multiplication with vector IFMAme2022/07/30 09:15 AM
                Computations on Big Integers---2022/07/26 09:48 AM
                  Computations on Big Integersnone2022/07/27 01:10 AM
                    Computations on Big Integers---2022/07/28 11:43 AM
                      Computations on Big Integers---2022/07/28 06:44 PM
              Computations on Big Integersdmcq2022/07/26 02:27 PM
                Computations on Big IntegersAdrian2022/07/27 08:15 AM
                  Computations on Big IntegersBrett2022/07/27 11:07 AM
      Yitian 710 Wes Felter2021/10/21 12:51 PM
        Yitian 710 Adrian2021/10/21 01:25 PM
    Yitian 710 Anon2021/10/21 06:08 AM
      Strange definition of the word single. (NT)anon22021/10/21 05:00 PM
        AMD Epyc uses chiplets. This is why "strange"?Mark Roulo2021/10/21 05:08 PM
          AMD Epyc uses chiplets. This is why "strange"?anon22021/10/21 05:34 PM
            Yeah. Blame spec.org, too, though!Mark Roulo2021/10/21 05:58 PM
              Yeah. Blame spec.org, too, though!anon22021/10/21 08:07 PM
                Yeah. Blame spec.org, too, though!Björn Ragnar Björnsson2022/07/17 06:23 AM
              Yeah. Blame spec.org, too, though!Rayla2022/07/17 09:13 AM
                Yeah. Blame spec.org, too, though!Anon2022/07/17 12:01 PM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell tangerine? 🍊