By: Paul A. Clayton (paaronclayton.delete@this.gmail.com), July 31, 2022 3:20 pm
Room: Moderated Discussions
Adrian (a.delete@this.acm.org) on July 30, 2022 9:27 am wrote:
[snip]
> AMD has stated clearly that there will be no ISA difference between the compact core and the big core,
> so it must also support AVX-512. This was obviously meant to contrast with their competition, who
> always had such differences, e.g. Denverton vs. Skylake Server (launched at the same time).
While such ISA compatibility has significant benefits, I suspect AVX-512 is expensive to implement in a small core because of the large architected state (32 64-byte registers plus mask registers). While a "low performance" core would not have to have as much register renaming and would presumably have fewer execution resources, the large state still feels burdensome. Some resources could be shared between cores in a Bulldozer-like design. Some resources might shared between threads (if multithreading is supported in a low performance core). GPU-style wider register access than execution and traditional banking could reduce the port count and thus area, but a 8 KiB single-ported register file would still seem to be kind of chunky. With sharing across cores, if AVX-512 is lightly used by both cores or heavily used by one, the active state could be smaller.
Narrower execution and sharing across threads would tend to reduce the need for rename registers by hiding some latency.
(I suppose, if full AVX-512 use is expected to be uncommon for small cores, it might be practical to steal storage from L1 cache to provide enough state for correct operation when more than eight or so registers are used. Even AVX2's 16 32-byte registers seems large for a core oriented for size. In terms of energy/power, power gating may minimize the effect of more architected state when the state is unused.)
Another consideration is whether smaller cores are designed for throughput — where wider execution would be desirable — or power/area. SIMD-friendly workloads would benefit less from out-of-order execution and other area/power-expensive features, so one could imagine a smaller core providing relatively high performance for such workloads but not preforming as well as a great-big-OoO design for "general purpose" workloads.
> For now, it is certain that the compact core will have less cache, but, as you say, it might also have
> vector units of reduced width. According to leaked benchmarks, Genoa appears to have an identical AVX-512
> throughput per core with Sapphire Rapids, which implies two 512-bit FMA units, so if not with a reduced
> width, the compact core might be simplified at least by having only one FMA unit instead of two.
While I do not have a suggestion for an architecture better than AVX-512 in terms of supporting blocking (it seems the anti-SIMD/scalable vector architectures orient toward streaming with little value reuse), efficiently exploiting data level parallelism, and supporting a diverse set of reasonable microarchitectures, I think AVX-512's flat state explosion makes small implementations problematic. I suspect even an architecture aware of data reuse would need to consider lifetime, access pattern, and scale, so a software-exposed storage hierarchy would probably not be sufficient. I also suspect that routing factors should be considered; while SIMD simplifies data and instruction routing (a single instruction is "broadcast to all functional units" and data forwarding/dependencies are common for the entire width), there may be opportunities for simplifying data routing. E.g., local conversion of array-of-structures to structure-of-arrays format [which is reminiscent of matrix transposition and probably part of a broader class of data rearrangements] could reduce work — compared to multiple strided vector loads implemented straightforwardly — and might be cacheable.
I suspect the concepts from "cache oblivious" software have some application to scalable hardware interfaces, but I also suspect that a latency (and bandwidth) hierarchy is not sufficient. While microarchitecture can compensate for architectural limitations, better interfaces still seem to have value.
[snip]
> AMD has stated clearly that there will be no ISA difference between the compact core and the big core,
> so it must also support AVX-512. This was obviously meant to contrast with their competition, who
> always had such differences, e.g. Denverton vs. Skylake Server (launched at the same time).
While such ISA compatibility has significant benefits, I suspect AVX-512 is expensive to implement in a small core because of the large architected state (32 64-byte registers plus mask registers). While a "low performance" core would not have to have as much register renaming and would presumably have fewer execution resources, the large state still feels burdensome. Some resources could be shared between cores in a Bulldozer-like design. Some resources might shared between threads (if multithreading is supported in a low performance core). GPU-style wider register access than execution and traditional banking could reduce the port count and thus area, but a 8 KiB single-ported register file would still seem to be kind of chunky. With sharing across cores, if AVX-512 is lightly used by both cores or heavily used by one, the active state could be smaller.
Narrower execution and sharing across threads would tend to reduce the need for rename registers by hiding some latency.
(I suppose, if full AVX-512 use is expected to be uncommon for small cores, it might be practical to steal storage from L1 cache to provide enough state for correct operation when more than eight or so registers are used. Even AVX2's 16 32-byte registers seems large for a core oriented for size. In terms of energy/power, power gating may minimize the effect of more architected state when the state is unused.)
Another consideration is whether smaller cores are designed for throughput — where wider execution would be desirable — or power/area. SIMD-friendly workloads would benefit less from out-of-order execution and other area/power-expensive features, so one could imagine a smaller core providing relatively high performance for such workloads but not preforming as well as a great-big-OoO design for "general purpose" workloads.
> For now, it is certain that the compact core will have less cache, but, as you say, it might also have
> vector units of reduced width. According to leaked benchmarks, Genoa appears to have an identical AVX-512
> throughput per core with Sapphire Rapids, which implies two 512-bit FMA units, so if not with a reduced
> width, the compact core might be simplified at least by having only one FMA unit instead of two.
While I do not have a suggestion for an architecture better than AVX-512 in terms of supporting blocking (it seems the anti-SIMD/scalable vector architectures orient toward streaming with little value reuse), efficiently exploiting data level parallelism, and supporting a diverse set of reasonable microarchitectures, I think AVX-512's flat state explosion makes small implementations problematic. I suspect even an architecture aware of data reuse would need to consider lifetime, access pattern, and scale, so a software-exposed storage hierarchy would probably not be sufficient. I also suspect that routing factors should be considered; while SIMD simplifies data and instruction routing (a single instruction is "broadcast to all functional units" and data forwarding/dependencies are common for the entire width), there may be opportunities for simplifying data routing. E.g., local conversion of array-of-structures to structure-of-arrays format [which is reminiscent of matrix transposition and probably part of a broader class of data rearrangements] could reduce work — compared to multiple strided vector loads implemented straightforwardly — and might be cacheable.
I suspect the concepts from "cache oblivious" software have some application to scalable hardware interfaces, but I also suspect that a latency (and bandwidth) hierarchy is not sufficient. While microarchitecture can compensate for architectural limitations, better interfaces still seem to have value.
Topic | Posted By | Date |
---|---|---|
Yitian 710 | anonymous2 | 2021/10/20 08:57 PM |
Yitian 710 | Adrian | 2021/10/21 12:20 AM |
Yitian 710 | Wilco | 2021/10/21 03:47 AM |
Yitian 710 | Rayla | 2021/10/21 05:52 AM |
Yitian 710 | Wilco | 2021/10/21 11:59 AM |
Yitian 710 | anon2 | 2021/10/21 05:16 PM |
Yitian 710 | Wilco | 2022/07/16 12:21 PM |
Yitian 710 | Anon | 2022/07/16 08:22 PM |
Yitian 710 | Rayla | 2022/07/17 09:10 AM |
Yitian 710 | Anon | 2022/07/17 12:04 PM |
Yitian 710 | Rayla | 2022/07/17 12:08 PM |
Yitian 710 | Wilco | 2022/07/17 01:16 PM |
Yitian 710 | Anon | 2022/07/17 01:32 PM |
Yitian 710 | Wilco | 2022/07/17 02:22 PM |
Yitian 710 | Anon | 2022/07/17 02:47 PM |
Yitian 710 | Wilco | 2022/07/17 03:50 PM |
Yitian 710 | Anon | 2022/07/17 08:46 PM |
Yitian 710 | Wilco | 2022/07/18 03:01 AM |
Yitian 710 | Anon | 2022/07/19 11:21 AM |
Yitian 710 | Wilco | 2022/07/19 06:15 PM |
Yitian 710 | Anon | 2022/07/21 01:25 AM |
Yitian 710 | none | 2022/07/21 01:49 AM |
Yitian 710 | Anon | 2022/07/21 03:03 AM |
Yitian 710 | none | 2022/07/21 04:34 AM |
Yitian 710 | James | 2022/07/21 02:29 AM |
Yitian 710 | Anon | 2022/07/21 03:05 AM |
Yitian 710 | Wilco | 2022/07/21 04:31 AM |
Yitian 710 | Anon | 2022/07/21 05:17 AM |
Yitian 710 | Wilco | 2022/07/21 05:33 AM |
Yitian 710 | Anon | 2022/07/21 05:50 AM |
Yitian 710 | Wilco | 2022/07/21 06:07 AM |
Yitian 710 | Anon | 2022/07/21 06:20 AM |
Yitian 710 | Wilco | 2022/07/21 10:02 AM |
Yitian 710 | Anon | 2022/07/21 10:22 AM |
Yitian 710 | Adrian | 2022/07/17 11:09 PM |
Yitian 710 | Wilco | 2022/07/18 01:15 AM |
Yitian 710 | Adrian | 2022/07/18 02:35 AM |
Yitian 710 | Adrian | 2022/07/16 11:19 PM |
Computations on Big Integers | Bill G | 2022/07/25 10:06 PM |
Computations on Big Integers | none | 2022/07/25 11:35 PM |
x86 MUL 64x64 | Eric Fink | 2022/07/26 01:06 AM |
x86 MUL 64x64 | Adrian | 2022/07/26 02:27 AM |
x86 MUL 64x64 | none | 2022/07/26 02:38 AM |
x86 MUL 64x64 | Jörn Engel | 2022/07/26 10:17 AM |
x86 MUL 64x64 | Linus Torvalds | 2022/07/27 10:13 AM |
x86 MUL 64x64 | ⚛ | 2022/07/28 09:40 AM |
x86 MUL 64x64 | Jörn Engel | 2022/07/28 10:18 AM |
More than 3 registers per instruction | -.- | 2022/07/28 07:01 PM |
More than 3 registers per instruction | Anon | 2022/07/28 10:39 PM |
More than 3 registers per instruction | Jörn Engel | 2022/07/28 10:42 PM |
More than 3 registers per instruction | -.- | 2022/07/29 04:31 AM |
Computations on Big Integers | Bill G | 2022/07/26 01:40 AM |
Computations on Big Integers | none | 2022/07/26 02:17 AM |
Computations on Big Integers | Bill G | 2022/07/26 03:52 AM |
Computations on Big Integers | --- | 2022/07/26 09:57 AM |
Computations on Big Integers | Adrian | 2022/07/26 02:53 AM |
Computations on Big Integers | Bill G | 2022/07/26 03:39 AM |
Computations on Big Integers | Adrian | 2022/07/26 04:21 AM |
Computations on Big Integers in Apple AMX Units | Bill G | 2022/07/26 04:28 AM |
Computations on Big Integers in Apple AMX Units | Adrian | 2022/07/26 05:13 AM |
Typo | Adrian | 2022/07/26 05:20 AM |
IEEE binary64 is 53 bits rather than 52. (NT) | Michael S | 2022/07/26 05:34 AM |
IEEE binary64 is 53 bits rather than 52. | Adrian | 2022/07/26 07:32 AM |
IEEE binary64 is 53 bits rather than 52. | Michael S | 2022/07/26 10:02 AM |
IEEE binary64 is 53 bits rather than 52. | Adrian | 2022/07/27 06:58 AM |
IEEE binary64 is 53 bits rather than 52. | none | 2022/07/27 07:14 AM |
IEEE binary64 is 53 bits rather than 52. | Adrian | 2022/07/27 07:55 AM |
Thanks a lot for the link to the article! (NT) | none | 2022/07/27 08:09 AM |
Typo | zArchJon | 2022/07/26 09:51 AM |
Typo | Michael S | 2022/07/26 10:25 AM |
Typo | zArchJon | 2022/07/26 11:52 AM |
Typo | Michael S | 2022/07/26 01:02 PM |
Computations on Big Integers | Michael S | 2022/07/26 05:55 AM |
Computations on Big Integers | Adrian | 2022/07/26 07:59 AM |
IFMA and Division | Bill G | 2022/07/26 04:25 PM |
IFMA and Division | rwessel | 2022/07/26 08:16 PM |
IFMA and Division | Adrian | 2022/07/27 07:25 AM |
Computations on Big Integers | none | 2022/07/27 01:22 AM |
Big integer multiplication with vector IFMA | Bill G | 2022/07/29 01:06 AM |
Big integer multiplication with vector IFMA | Adrian | 2022/07/29 01:35 AM |
Big integer multiplication with vector IFMA | -.- | 2022/07/29 04:32 AM |
Big integer multiplication with vector IFMA | Adrian | 2022/07/29 09:47 PM |
Big integer multiplication with vector IFMA | Anon | 2022/07/30 08:12 AM |
Big integer multiplication with vector IFMA | Adrian | 2022/07/30 09:27 AM |
AVX-512 unfriendly to heter-performance cores | Paul A. Clayton | 2022/07/31 03:20 PM |
AVX-512 unfriendly to heter-performance cores | Anon | 2022/07/31 03:33 PM |
AVX-512 unfriendly to heter-performance cores | anonymou5 | 2022/07/31 05:03 PM |
AVX-512 unfriendly to heter-performance cores | Brett | 2022/07/31 07:26 PM |
AVX-512 unfriendly to heter-performance cores | Adrian | 2022/08/01 01:45 AM |
Why can't E-cores have narrow/slow AVX-512? (NT) | anonymous2 | 2022/08/01 03:37 PM |
Why can't E-cores have narrow/slow AVX-512? | Ivan | 2022/08/02 12:09 AM |
Why can't E-cores have narrow/slow AVX-512? | anonymou5 | 2022/08/02 10:13 AM |
Why can't E-cores have narrow/slow AVX-512? | Dummond D. Slow | 2022/08/02 03:02 PM |
AVX-512 unfriendly to heter-performance cores | Paul A. Clayton | 2022/08/02 01:19 PM |
AVX-512 unfriendly to heter-performance cores | Anon | 2022/08/02 09:09 PM |
AVX-512 unfriendly to heter-performance cores | Adrian | 2022/08/03 12:50 AM |
AVX-512 unfriendly to heter-performance cores | Anon | 2022/08/03 09:15 AM |
AVX-512 unfriendly to heter-performance cores | -.- | 2022/08/03 08:17 PM |
AVX-512 unfriendly to heter-performance cores | Anon | 2022/08/03 09:02 PM |
IFMA: empty promises from Intel as usual | Kent R | 2022/07/29 07:15 PM |
No hype lasts forever | Anon | 2022/07/30 08:06 AM |
Big integer multiplication with vector IFMA | me | 2022/07/30 09:15 AM |
Computations on Big Integers | --- | 2022/07/26 09:48 AM |
Computations on Big Integers | none | 2022/07/27 01:10 AM |
Computations on Big Integers | --- | 2022/07/28 11:43 AM |
Computations on Big Integers | --- | 2022/07/28 06:44 PM |
Computations on Big Integers | dmcq | 2022/07/26 02:27 PM |
Computations on Big Integers | Adrian | 2022/07/27 08:15 AM |
Computations on Big Integers | Brett | 2022/07/27 11:07 AM |
Yitian 710 | Wes Felter | 2021/10/21 12:51 PM |
Yitian 710 | Adrian | 2021/10/21 01:25 PM |
Yitian 710 | Anon | 2021/10/21 06:08 AM |
Strange definition of the word single. (NT) | anon2 | 2021/10/21 05:00 PM |
AMD Epyc uses chiplets. This is why "strange"? | Mark Roulo | 2021/10/21 05:08 PM |
AMD Epyc uses chiplets. This is why "strange"? | anon2 | 2021/10/21 05:34 PM |
Yeah. Blame spec.org, too, though! | Mark Roulo | 2021/10/21 05:58 PM |
Yeah. Blame spec.org, too, though! | anon2 | 2021/10/21 08:07 PM |
Yeah. Blame spec.org, too, though! | Björn Ragnar Björnsson | 2022/07/17 06:23 AM |
Yeah. Blame spec.org, too, though! | Rayla | 2022/07/17 09:13 AM |
Yeah. Blame spec.org, too, though! | Anon | 2022/07/17 12:01 PM |