By: Adrian (a.delete@this.acm.org), August 1, 2022 1:45 am
Room: Moderated Discussions
Paul A. Clayton (paaronclayton.delete@this.gmail.com) on July 31, 2022 3:20 pm wrote:
>
> While such ISA compatibility has significant benefits, I suspect AVX-512 is expensive to implement in a small
> core because of the large architected state (32 64-byte registers plus mask registers). While a "low performance"
> core would not have to have as much register renaming and would presumably have fewer execution resources, the
> large state still feels burdensome. Some resources could be shared between cores in a Bulldozer-like design.
> Some resources might shared between threads (if multithreading is supported in a low performance core). GPU-style
> wider register access than execution and traditional banking could reduce the port count and thus area, but
> a 8 KiB single-ported register file would still seem to be kind of chunky. With sharing across cores, if AVX-512
> is lightly used by both cores or heavily used by one, the active state could be smaller.
>
> Narrower execution and sharing across threads would tend to reduce
> the need for rename registers by hiding some latency.
While 2 kB is not a small size for the architecturally-visible registers, I cannot believe that this size can cause any serious implementation difficulties in small cores.
Already 40 years ago, there were RISC CPUs with an architectural register file only a few times smaller, e.g. of 512 B, even if they were using millions times less transistors per CPU core.
A modern CPU has more register file ports, but AVX-512 does not need more ports than the ISAs with a smaller register file, because this is one of the main advantages of using wider registers, i.e. that increasing the register width increases the throughput without requiring additional register file ports.
Moreover, all the resources related to register renaming and to any other parts of the CPU control depend on the number of registers, i.e. 32, and not on their width.
32 registers for the FPU, besides the general-purpose registers, have been used in a very large number of CPUs for many decades, most of them being many orders of magnitude less complex than the smallest cores of today, such as Cortex-A510.
So I do not buy the argument that implementing 32 wide registers can cause any difficulties. There are many other features which influence the core complexity much more, and they can be removed from a small core without affecting ISA compatibility and the number and size of the visible registers.
Now that is too late to be changed, but if Intel would have ever believed that the size of the registers matters, it would have been very easy to provide a couple of CPUID bits with the implemented register width, e.g. 128/256/512, allowing the software to use the corresponding AVX-512 subset of instructions, which would have been much better than falling back to AVX.
I think that from their actions (e.g. the backporting of some instructions from AVX-512 encoding to AVX encoding, and earlier of some from AVX encoding to SSE encoding) it results that the main problem with implementing AVX-512 is extending the instruction decoders with the support of another set of encoding formats and not the implementation of more or wider registers.
>
> While such ISA compatibility has significant benefits, I suspect AVX-512 is expensive to implement in a small
> core because of the large architected state (32 64-byte registers plus mask registers). While a "low performance"
> core would not have to have as much register renaming and would presumably have fewer execution resources, the
> large state still feels burdensome. Some resources could be shared between cores in a Bulldozer-like design.
> Some resources might shared between threads (if multithreading is supported in a low performance core). GPU-style
> wider register access than execution and traditional banking could reduce the port count and thus area, but
> a 8 KiB single-ported register file would still seem to be kind of chunky. With sharing across cores, if AVX-512
> is lightly used by both cores or heavily used by one, the active state could be smaller.
>
> Narrower execution and sharing across threads would tend to reduce
> the need for rename registers by hiding some latency.
While 2 kB is not a small size for the architecturally-visible registers, I cannot believe that this size can cause any serious implementation difficulties in small cores.
Already 40 years ago, there were RISC CPUs with an architectural register file only a few times smaller, e.g. of 512 B, even if they were using millions times less transistors per CPU core.
A modern CPU has more register file ports, but AVX-512 does not need more ports than the ISAs with a smaller register file, because this is one of the main advantages of using wider registers, i.e. that increasing the register width increases the throughput without requiring additional register file ports.
Moreover, all the resources related to register renaming and to any other parts of the CPU control depend on the number of registers, i.e. 32, and not on their width.
32 registers for the FPU, besides the general-purpose registers, have been used in a very large number of CPUs for many decades, most of them being many orders of magnitude less complex than the smallest cores of today, such as Cortex-A510.
So I do not buy the argument that implementing 32 wide registers can cause any difficulties. There are many other features which influence the core complexity much more, and they can be removed from a small core without affecting ISA compatibility and the number and size of the visible registers.
Now that is too late to be changed, but if Intel would have ever believed that the size of the registers matters, it would have been very easy to provide a couple of CPUID bits with the implemented register width, e.g. 128/256/512, allowing the software to use the corresponding AVX-512 subset of instructions, which would have been much better than falling back to AVX.
I think that from their actions (e.g. the backporting of some instructions from AVX-512 encoding to AVX encoding, and earlier of some from AVX encoding to SSE encoding) it results that the main problem with implementing AVX-512 is extending the instruction decoders with the support of another set of encoding formats and not the implementation of more or wider registers.
Topic | Posted By | Date |
---|---|---|
Yitian 710 | anonymous2 | 2021/10/20 08:57 PM |
Yitian 710 | Adrian | 2021/10/21 12:20 AM |
Yitian 710 | Wilco | 2021/10/21 03:47 AM |
Yitian 710 | Rayla | 2021/10/21 05:52 AM |
Yitian 710 | Wilco | 2021/10/21 11:59 AM |
Yitian 710 | anon2 | 2021/10/21 05:16 PM |
Yitian 710 | Wilco | 2022/07/16 12:21 PM |
Yitian 710 | Anon | 2022/07/16 08:22 PM |
Yitian 710 | Rayla | 2022/07/17 09:10 AM |
Yitian 710 | Anon | 2022/07/17 12:04 PM |
Yitian 710 | Rayla | 2022/07/17 12:08 PM |
Yitian 710 | Wilco | 2022/07/17 01:16 PM |
Yitian 710 | Anon | 2022/07/17 01:32 PM |
Yitian 710 | Wilco | 2022/07/17 02:22 PM |
Yitian 710 | Anon | 2022/07/17 02:47 PM |
Yitian 710 | Wilco | 2022/07/17 03:50 PM |
Yitian 710 | Anon | 2022/07/17 08:46 PM |
Yitian 710 | Wilco | 2022/07/18 03:01 AM |
Yitian 710 | Anon | 2022/07/19 11:21 AM |
Yitian 710 | Wilco | 2022/07/19 06:15 PM |
Yitian 710 | Anon | 2022/07/21 01:25 AM |
Yitian 710 | none | 2022/07/21 01:49 AM |
Yitian 710 | Anon | 2022/07/21 03:03 AM |
Yitian 710 | none | 2022/07/21 04:34 AM |
Yitian 710 | James | 2022/07/21 02:29 AM |
Yitian 710 | Anon | 2022/07/21 03:05 AM |
Yitian 710 | Wilco | 2022/07/21 04:31 AM |
Yitian 710 | Anon | 2022/07/21 05:17 AM |
Yitian 710 | Wilco | 2022/07/21 05:33 AM |
Yitian 710 | Anon | 2022/07/21 05:50 AM |
Yitian 710 | Wilco | 2022/07/21 06:07 AM |
Yitian 710 | Anon | 2022/07/21 06:20 AM |
Yitian 710 | Wilco | 2022/07/21 10:02 AM |
Yitian 710 | Anon | 2022/07/21 10:22 AM |
Yitian 710 | Adrian | 2022/07/17 11:09 PM |
Yitian 710 | Wilco | 2022/07/18 01:15 AM |
Yitian 710 | Adrian | 2022/07/18 02:35 AM |
Yitian 710 | Adrian | 2022/07/16 11:19 PM |
Computations on Big Integers | Bill G | 2022/07/25 10:06 PM |
Computations on Big Integers | none | 2022/07/25 11:35 PM |
x86 MUL 64x64 | Eric Fink | 2022/07/26 01:06 AM |
x86 MUL 64x64 | Adrian | 2022/07/26 02:27 AM |
x86 MUL 64x64 | none | 2022/07/26 02:38 AM |
x86 MUL 64x64 | Jörn Engel | 2022/07/26 10:17 AM |
x86 MUL 64x64 | Linus Torvalds | 2022/07/27 10:13 AM |
x86 MUL 64x64 | ⚛ | 2022/07/28 09:40 AM |
x86 MUL 64x64 | Jörn Engel | 2022/07/28 10:18 AM |
More than 3 registers per instruction | -.- | 2022/07/28 07:01 PM |
More than 3 registers per instruction | Anon | 2022/07/28 10:39 PM |
More than 3 registers per instruction | Jörn Engel | 2022/07/28 10:42 PM |
More than 3 registers per instruction | -.- | 2022/07/29 04:31 AM |
Computations on Big Integers | Bill G | 2022/07/26 01:40 AM |
Computations on Big Integers | none | 2022/07/26 02:17 AM |
Computations on Big Integers | Bill G | 2022/07/26 03:52 AM |
Computations on Big Integers | --- | 2022/07/26 09:57 AM |
Computations on Big Integers | Adrian | 2022/07/26 02:53 AM |
Computations on Big Integers | Bill G | 2022/07/26 03:39 AM |
Computations on Big Integers | Adrian | 2022/07/26 04:21 AM |
Computations on Big Integers in Apple AMX Units | Bill G | 2022/07/26 04:28 AM |
Computations on Big Integers in Apple AMX Units | Adrian | 2022/07/26 05:13 AM |
Typo | Adrian | 2022/07/26 05:20 AM |
IEEE binary64 is 53 bits rather than 52. (NT) | Michael S | 2022/07/26 05:34 AM |
IEEE binary64 is 53 bits rather than 52. | Adrian | 2022/07/26 07:32 AM |
IEEE binary64 is 53 bits rather than 52. | Michael S | 2022/07/26 10:02 AM |
IEEE binary64 is 53 bits rather than 52. | Adrian | 2022/07/27 06:58 AM |
IEEE binary64 is 53 bits rather than 52. | none | 2022/07/27 07:14 AM |
IEEE binary64 is 53 bits rather than 52. | Adrian | 2022/07/27 07:55 AM |
Thanks a lot for the link to the article! (NT) | none | 2022/07/27 08:09 AM |
Typo | zArchJon | 2022/07/26 09:51 AM |
Typo | Michael S | 2022/07/26 10:25 AM |
Typo | zArchJon | 2022/07/26 11:52 AM |
Typo | Michael S | 2022/07/26 01:02 PM |
Computations on Big Integers | Michael S | 2022/07/26 05:55 AM |
Computations on Big Integers | Adrian | 2022/07/26 07:59 AM |
IFMA and Division | Bill G | 2022/07/26 04:25 PM |
IFMA and Division | rwessel | 2022/07/26 08:16 PM |
IFMA and Division | Adrian | 2022/07/27 07:25 AM |
Computations on Big Integers | none | 2022/07/27 01:22 AM |
Big integer multiplication with vector IFMA | Bill G | 2022/07/29 01:06 AM |
Big integer multiplication with vector IFMA | Adrian | 2022/07/29 01:35 AM |
Big integer multiplication with vector IFMA | -.- | 2022/07/29 04:32 AM |
Big integer multiplication with vector IFMA | Adrian | 2022/07/29 09:47 PM |
Big integer multiplication with vector IFMA | Anon | 2022/07/30 08:12 AM |
Big integer multiplication with vector IFMA | Adrian | 2022/07/30 09:27 AM |
AVX-512 unfriendly to heter-performance cores | Paul A. Clayton | 2022/07/31 03:20 PM |
AVX-512 unfriendly to heter-performance cores | Anon | 2022/07/31 03:33 PM |
AVX-512 unfriendly to heter-performance cores | anonymou5 | 2022/07/31 05:03 PM |
AVX-512 unfriendly to heter-performance cores | Brett | 2022/07/31 07:26 PM |
AVX-512 unfriendly to heter-performance cores | Adrian | 2022/08/01 01:45 AM |
Why can't E-cores have narrow/slow AVX-512? (NT) | anonymous2 | 2022/08/01 03:37 PM |
Why can't E-cores have narrow/slow AVX-512? | Ivan | 2022/08/02 12:09 AM |
Why can't E-cores have narrow/slow AVX-512? | anonymou5 | 2022/08/02 10:13 AM |
Why can't E-cores have narrow/slow AVX-512? | Dummond D. Slow | 2022/08/02 03:02 PM |
AVX-512 unfriendly to heter-performance cores | Paul A. Clayton | 2022/08/02 01:19 PM |
AVX-512 unfriendly to heter-performance cores | Anon | 2022/08/02 09:09 PM |
AVX-512 unfriendly to heter-performance cores | Adrian | 2022/08/03 12:50 AM |
AVX-512 unfriendly to heter-performance cores | Anon | 2022/08/03 09:15 AM |
AVX-512 unfriendly to heter-performance cores | -.- | 2022/08/03 08:17 PM |
AVX-512 unfriendly to heter-performance cores | Anon | 2022/08/03 09:02 PM |
IFMA: empty promises from Intel as usual | Kent R | 2022/07/29 07:15 PM |
No hype lasts forever | Anon | 2022/07/30 08:06 AM |
Big integer multiplication with vector IFMA | me | 2022/07/30 09:15 AM |
Computations on Big Integers | --- | 2022/07/26 09:48 AM |
Computations on Big Integers | none | 2022/07/27 01:10 AM |
Computations on Big Integers | --- | 2022/07/28 11:43 AM |
Computations on Big Integers | --- | 2022/07/28 06:44 PM |
Computations on Big Integers | dmcq | 2022/07/26 02:27 PM |
Computations on Big Integers | Adrian | 2022/07/27 08:15 AM |
Computations on Big Integers | Brett | 2022/07/27 11:07 AM |
Yitian 710 | Wes Felter | 2021/10/21 12:51 PM |
Yitian 710 | Adrian | 2021/10/21 01:25 PM |
Yitian 710 | Anon | 2021/10/21 06:08 AM |
Strange definition of the word single. (NT) | anon2 | 2021/10/21 05:00 PM |
AMD Epyc uses chiplets. This is why "strange"? | Mark Roulo | 2021/10/21 05:08 PM |
AMD Epyc uses chiplets. This is why "strange"? | anon2 | 2021/10/21 05:34 PM |
Yeah. Blame spec.org, too, though! | Mark Roulo | 2021/10/21 05:58 PM |
Yeah. Blame spec.org, too, though! | anon2 | 2021/10/21 08:07 PM |
Yeah. Blame spec.org, too, though! | Björn Ragnar Björnsson | 2022/07/17 06:23 AM |
Yeah. Blame spec.org, too, though! | Rayla | 2022/07/17 09:13 AM |
Yeah. Blame spec.org, too, though! | Anon | 2022/07/17 12:01 PM |