By: ⚛ (0xe2.0x9a.0x9b.delete@this.gmail.com), July 28, 2022 9:40 am
Room: Moderated Discussions
Linus Torvalds (torvalds.delete@this.linux-foundation.org) on July 27, 2022 10:13 am wrote:
> Jörn Engel (joern.delete@this.purestorage.com) on July 26, 2022 10:17 am wrote:
> >
> > Note that compilers do a surprisingly poor job here, at least until recently.
>
> I think the "source in %rdx" is somewhat unusual (normally %rax is the special register,
> with obviously %cl for shift counts, and %rdx:%rax being special for the old multiply),
> and most x86 compilers end up having been tuned for different register use.
>
> And gcc in particular tends to want to use fixed register pairs even when the instructions
> don't require it, so if you do 128-bit math - which you obviously are doing if you're using
> 'mulx' - gcc often wants to pair up %rax/%rdx, with %rdx being the high word.
>
> So even when the hardware doesn't have any particular register pairing preferences, gcc
> definitely does, and then uses odd stack spills etc as a way to move things around.
>
> I don't know why mulx does that unusual source, but I assume that Intel did some example loops
> and that it ends up working better when you get it right (possibly exactly because other ops
> want to use %rax for its special use - including that regular old-fashioned 'mul').
Just a note/idea that came to my mind while reading your post: If a CPU can execute 2+ register moves per clock and 1 ALU instruction per clock (3+ operations per clock in total), then the operands of all ALU instructions can be implicit/fixed registers, _without causing_ a major performance degradation. Some random examples:
where %flags2 is a flag register (this totally-hypothetical CPU has multiple flag registers). The MUL instruction has no %flagsN destination register because the MUL cannot overflow.
It is likely that in such an instruction set architecture, a single MOV instruction would be encoding multiple register moves, for example "MOV %r10 := %r3, %r6 := %r13, %r3 := %flags2" where %r3, %r13, %flags2 are read atomically and %r10, %r6, %r3 are _then_ written atomically (that is: the combination %r10 := %r3 and %r3 := %flags2 isn't a write-after-read hazard nor a read-after-write hazard).
(Because you are usually overly critical of ideas that do not match your worldview, I am forced to note that: The above paragraphs are just an idea .... if the idea happens to mismatch your worldview there is no need for you to start criticizing how "fundamentally bad" the idea is in case you decide to write a response to my post. Thanks.)
-atom
> Jörn Engel (joern.delete@this.purestorage.com) on July 26, 2022 10:17 am wrote:
> >
> > Note that compilers do a surprisingly poor job here, at least until recently.
>
> I think the "source in %rdx" is somewhat unusual (normally %rax is the special register,
> with obviously %cl for shift counts, and %rdx:%rax being special for the old multiply),
> and most x86 compilers end up having been tuned for different register use.
>
> And gcc in particular tends to want to use fixed register pairs even when the instructions
> don't require it, so if you do 128-bit math - which you obviously are doing if you're using
> 'mulx' - gcc often wants to pair up %rax/%rdx, with %rdx being the high word.
>
> So even when the hardware doesn't have any particular register pairing preferences, gcc
> definitely does, and then uses odd stack spills etc as a way to move things around.
>
> I don't know why mulx does that unusual source, but I assume that Intel did some example loops
> and that it ends up working better when you get it right (possibly exactly because other ops
> want to use %rax for its special use - including that regular old-fashioned 'mul').
Just a note/idea that came to my mind while reading your post: If a CPU can execute 2+ register moves per clock and 1 ALU instruction per clock (3+ operations per clock in total), then the operands of all ALU instructions can be implicit/fixed registers, _without causing_ a major performance degradation. Some random examples:
MOV ...; ADD; MOV ... // The ADD is always %r5, %flags2 := %r3 + %r4
MOV ...; MUL; MOV ... // The MUL is always %r8_%r3 := %r10 * %r0
where %flags2 is a flag register (this totally-hypothetical CPU has multiple flag registers). The MUL instruction has no %flagsN destination register because the MUL cannot overflow.
It is likely that in such an instruction set architecture, a single MOV instruction would be encoding multiple register moves, for example "MOV %r10 := %r3, %r6 := %r13, %r3 := %flags2" where %r3, %r13, %flags2 are read atomically and %r10, %r6, %r3 are _then_ written atomically (that is: the combination %r10 := %r3 and %r3 := %flags2 isn't a write-after-read hazard nor a read-after-write hazard).
(Because you are usually overly critical of ideas that do not match your worldview, I am forced to note that: The above paragraphs are just an idea .... if the idea happens to mismatch your worldview there is no need for you to start criticizing how "fundamentally bad" the idea is in case you decide to write a response to my post. Thanks.)
-atom
Topic | Posted By | Date |
---|---|---|
Yitian 710 | anonymous2 | 2021/10/20 08:57 PM |
Yitian 710 | Adrian | 2021/10/21 12:20 AM |
Yitian 710 | Wilco | 2021/10/21 03:47 AM |
Yitian 710 | Rayla | 2021/10/21 05:52 AM |
Yitian 710 | Wilco | 2021/10/21 11:59 AM |
Yitian 710 | anon2 | 2021/10/21 05:16 PM |
Yitian 710 | Wilco | 2022/07/16 12:21 PM |
Yitian 710 | Anon | 2022/07/16 08:22 PM |
Yitian 710 | Rayla | 2022/07/17 09:10 AM |
Yitian 710 | Anon | 2022/07/17 12:04 PM |
Yitian 710 | Rayla | 2022/07/17 12:08 PM |
Yitian 710 | Wilco | 2022/07/17 01:16 PM |
Yitian 710 | Anon | 2022/07/17 01:32 PM |
Yitian 710 | Wilco | 2022/07/17 02:22 PM |
Yitian 710 | Anon | 2022/07/17 02:47 PM |
Yitian 710 | Wilco | 2022/07/17 03:50 PM |
Yitian 710 | Anon | 2022/07/17 08:46 PM |
Yitian 710 | Wilco | 2022/07/18 03:01 AM |
Yitian 710 | Anon | 2022/07/19 11:21 AM |
Yitian 710 | Wilco | 2022/07/19 06:15 PM |
Yitian 710 | Anon | 2022/07/21 01:25 AM |
Yitian 710 | none | 2022/07/21 01:49 AM |
Yitian 710 | Anon | 2022/07/21 03:03 AM |
Yitian 710 | none | 2022/07/21 04:34 AM |
Yitian 710 | James | 2022/07/21 02:29 AM |
Yitian 710 | Anon | 2022/07/21 03:05 AM |
Yitian 710 | Wilco | 2022/07/21 04:31 AM |
Yitian 710 | Anon | 2022/07/21 05:17 AM |
Yitian 710 | Wilco | 2022/07/21 05:33 AM |
Yitian 710 | Anon | 2022/07/21 05:50 AM |
Yitian 710 | Wilco | 2022/07/21 06:07 AM |
Yitian 710 | Anon | 2022/07/21 06:20 AM |
Yitian 710 | Wilco | 2022/07/21 10:02 AM |
Yitian 710 | Anon | 2022/07/21 10:22 AM |
Yitian 710 | Adrian | 2022/07/17 11:09 PM |
Yitian 710 | Wilco | 2022/07/18 01:15 AM |
Yitian 710 | Adrian | 2022/07/18 02:35 AM |
Yitian 710 | Adrian | 2022/07/16 11:19 PM |
Computations on Big Integers | Bill G | 2022/07/25 10:06 PM |
Computations on Big Integers | none | 2022/07/25 11:35 PM |
x86 MUL 64x64 | Eric Fink | 2022/07/26 01:06 AM |
x86 MUL 64x64 | Adrian | 2022/07/26 02:27 AM |
x86 MUL 64x64 | none | 2022/07/26 02:38 AM |
x86 MUL 64x64 | Jörn Engel | 2022/07/26 10:17 AM |
x86 MUL 64x64 | Linus Torvalds | 2022/07/27 10:13 AM |
x86 MUL 64x64 | ⚛ | 2022/07/28 09:40 AM |
x86 MUL 64x64 | Jörn Engel | 2022/07/28 10:18 AM |
More than 3 registers per instruction | -.- | 2022/07/28 07:01 PM |
More than 3 registers per instruction | Anon | 2022/07/28 10:39 PM |
More than 3 registers per instruction | Jörn Engel | 2022/07/28 10:42 PM |
More than 3 registers per instruction | -.- | 2022/07/29 04:31 AM |
Computations on Big Integers | Bill G | 2022/07/26 01:40 AM |
Computations on Big Integers | none | 2022/07/26 02:17 AM |
Computations on Big Integers | Bill G | 2022/07/26 03:52 AM |
Computations on Big Integers | --- | 2022/07/26 09:57 AM |
Computations on Big Integers | Adrian | 2022/07/26 02:53 AM |
Computations on Big Integers | Bill G | 2022/07/26 03:39 AM |
Computations on Big Integers | Adrian | 2022/07/26 04:21 AM |
Computations on Big Integers in Apple AMX Units | Bill G | 2022/07/26 04:28 AM |
Computations on Big Integers in Apple AMX Units | Adrian | 2022/07/26 05:13 AM |
Typo | Adrian | 2022/07/26 05:20 AM |
IEEE binary64 is 53 bits rather than 52. (NT) | Michael S | 2022/07/26 05:34 AM |
IEEE binary64 is 53 bits rather than 52. | Adrian | 2022/07/26 07:32 AM |
IEEE binary64 is 53 bits rather than 52. | Michael S | 2022/07/26 10:02 AM |
IEEE binary64 is 53 bits rather than 52. | Adrian | 2022/07/27 06:58 AM |
IEEE binary64 is 53 bits rather than 52. | none | 2022/07/27 07:14 AM |
IEEE binary64 is 53 bits rather than 52. | Adrian | 2022/07/27 07:55 AM |
Thanks a lot for the link to the article! (NT) | none | 2022/07/27 08:09 AM |
Typo | zArchJon | 2022/07/26 09:51 AM |
Typo | Michael S | 2022/07/26 10:25 AM |
Typo | zArchJon | 2022/07/26 11:52 AM |
Typo | Michael S | 2022/07/26 01:02 PM |
Computations on Big Integers | Michael S | 2022/07/26 05:55 AM |
Computations on Big Integers | Adrian | 2022/07/26 07:59 AM |
IFMA and Division | Bill G | 2022/07/26 04:25 PM |
IFMA and Division | rwessel | 2022/07/26 08:16 PM |
IFMA and Division | Adrian | 2022/07/27 07:25 AM |
Computations on Big Integers | none | 2022/07/27 01:22 AM |
Big integer multiplication with vector IFMA | Bill G | 2022/07/29 01:06 AM |
Big integer multiplication with vector IFMA | Adrian | 2022/07/29 01:35 AM |
Big integer multiplication with vector IFMA | -.- | 2022/07/29 04:32 AM |
Big integer multiplication with vector IFMA | Adrian | 2022/07/29 09:47 PM |
Big integer multiplication with vector IFMA | Anon | 2022/07/30 08:12 AM |
Big integer multiplication with vector IFMA | Adrian | 2022/07/30 09:27 AM |
AVX-512 unfriendly to heter-performance cores | Paul A. Clayton | 2022/07/31 03:20 PM |
AVX-512 unfriendly to heter-performance cores | Anon | 2022/07/31 03:33 PM |
AVX-512 unfriendly to heter-performance cores | anonymou5 | 2022/07/31 05:03 PM |
AVX-512 unfriendly to heter-performance cores | Brett | 2022/07/31 07:26 PM |
AVX-512 unfriendly to heter-performance cores | Adrian | 2022/08/01 01:45 AM |
Why can't E-cores have narrow/slow AVX-512? (NT) | anonymous2 | 2022/08/01 03:37 PM |
Why can't E-cores have narrow/slow AVX-512? | Ivan | 2022/08/02 12:09 AM |
Why can't E-cores have narrow/slow AVX-512? | anonymou5 | 2022/08/02 10:13 AM |
Why can't E-cores have narrow/slow AVX-512? | Dummond D. Slow | 2022/08/02 03:02 PM |
AVX-512 unfriendly to heter-performance cores | Paul A. Clayton | 2022/08/02 01:19 PM |
AVX-512 unfriendly to heter-performance cores | Anon | 2022/08/02 09:09 PM |
AVX-512 unfriendly to heter-performance cores | Adrian | 2022/08/03 12:50 AM |
AVX-512 unfriendly to heter-performance cores | Anon | 2022/08/03 09:15 AM |
AVX-512 unfriendly to heter-performance cores | -.- | 2022/08/03 08:17 PM |
AVX-512 unfriendly to heter-performance cores | Anon | 2022/08/03 09:02 PM |
IFMA: empty promises from Intel as usual | Kent R | 2022/07/29 07:15 PM |
No hype lasts forever | Anon | 2022/07/30 08:06 AM |
Big integer multiplication with vector IFMA | me | 2022/07/30 09:15 AM |
Computations on Big Integers | --- | 2022/07/26 09:48 AM |
Computations on Big Integers | none | 2022/07/27 01:10 AM |
Computations on Big Integers | --- | 2022/07/28 11:43 AM |
Computations on Big Integers | --- | 2022/07/28 06:44 PM |
Computations on Big Integers | dmcq | 2022/07/26 02:27 PM |
Computations on Big Integers | Adrian | 2022/07/27 08:15 AM |
Computations on Big Integers | Brett | 2022/07/27 11:07 AM |
Yitian 710 | Wes Felter | 2021/10/21 12:51 PM |
Yitian 710 | Adrian | 2021/10/21 01:25 PM |
Yitian 710 | Anon | 2021/10/21 06:08 AM |
Strange definition of the word single. (NT) | anon2 | 2021/10/21 05:00 PM |
AMD Epyc uses chiplets. This is why "strange"? | Mark Roulo | 2021/10/21 05:08 PM |
AMD Epyc uses chiplets. This is why "strange"? | anon2 | 2021/10/21 05:34 PM |
Yeah. Blame spec.org, too, though! | Mark Roulo | 2021/10/21 05:58 PM |
Yeah. Blame spec.org, too, though! | anon2 | 2021/10/21 08:07 PM |
Yeah. Blame spec.org, too, though! | Björn Ragnar Björnsson | 2022/07/17 06:23 AM |
Yeah. Blame spec.org, too, though! | Rayla | 2022/07/17 09:13 AM |
Yeah. Blame spec.org, too, though! | Anon | 2022/07/17 12:01 PM |