Article: Knights Landing CPU Speculation
By: Sylvain Collange (full.name.delete.delete@this.this.gmail.com), November 24, 2013 9:28 am
Room: Moderated Discussions
Michael S (already5chosen.delete@this.yahoo.com) on November 24, 2013 7:06 am wrote:
> Integer instructions are the smaller part of the problem. The bigger part are memory
> instructions.
> In my experience, for typical linear algebra algorithm with 32 sw visible registers
> it's pretty hard to reduce the number of memory accesses per to FMA below 0.7-0.8.
> And in that regard linear algebra is easier than most.
Agreed.
> I am not sure that Fermi/Kepler reference is relevant in discussion
> of KNL. I am sorry that I did it myself in a previous post.
Besides the instruction set and SIMT/SIMD differences, aren't they close architectures targeted at the same market?
> Certainly not for KNC-style core, where load and OP are separate pipeline operations.
> For Bonnel-style core, with its CISC (or, if you want, TI TMS320C30/C40 -style) load+op pipeline - may be.
> But resulting core wouldn't resemble Bonnel/Saltwell, even less so Silvermont.
I thought KNC was handling load+op in the same pipeline. If so I do not understand the distinction with Bonnel.
> > I am much more skeptical about out-of-order execution of a fully mask-predicated instruction set...
>
> You mean, too many register inputs per uOP?
> I didn't look at AVX-512 at sufficient details. How many register inputs will be needed per FMA?
More than the number of inputs, the problem is partial dependencies that prevent proper register renaming. Consider pseudo-code such as:
1: add r1{k1}, r0, r2
2: sub r1{k2}, r0, r3
If masks k1 and k2 have any chance of overlapping, then the result of instruction 2 depends on the result of instruction 1 (needs to read-modify-write r1).
A straightforward OoO implementation would assume a data-dependency and schedule instruction 2 once the result of instruction 1 is available. This is much worse than running in-order and merging during write-back.
Of course, in most cases k1 and k2 won't overlap (this is typically an if-then-else that went through if-conversion).
This is a similar issue as mixing AVX and SSE instructions, or 16-bit and 32-bit x86. Legacy instructions that only update the lower part of registers create partial dependencies. At least on Ivy Bridge, the retribution for forgetting a vzeroupper is a ~50-cycle penalty...
Except it gets much worse with masked instructions, since the register subset that an instruction updates is unknown at schedule time, instead of being nicely encoded in the instruction...
I am really curious to know how Skylake (or KNL if OoO) intend to handle this.
I guess the easiest way would be to crack every predicated instruction into 2 µops:
add tmp, r0, r2
merge r1, k1, r1, tmp
sub tmp, r0, r3
merge r1, k2, r1, tmp
But this is not so cheap, and merges still need to be done in-order.
> Integer instructions are the smaller part of the problem. The bigger part are memory
> instructions.
> In my experience, for typical linear algebra algorithm with 32 sw visible registers
> it's pretty hard to reduce the number of memory accesses per to FMA below 0.7-0.8.
> And in that regard linear algebra is easier than most.
Agreed.
> I am not sure that Fermi/Kepler reference is relevant in discussion
> of KNL. I am sorry that I did it myself in a previous post.
Besides the instruction set and SIMT/SIMD differences, aren't they close architectures targeted at the same market?
> Certainly not for KNC-style core, where load and OP are separate pipeline operations.
> For Bonnel-style core, with its CISC (or, if you want, TI TMS320C30/C40 -style) load+op pipeline - may be.
> But resulting core wouldn't resemble Bonnel/Saltwell, even less so Silvermont.
I thought KNC was handling load+op in the same pipeline. If so I do not understand the distinction with Bonnel.
> > I am much more skeptical about out-of-order execution of a fully mask-predicated instruction set...
>
> You mean, too many register inputs per uOP?
> I didn't look at AVX-512 at sufficient details. How many register inputs will be needed per FMA?
More than the number of inputs, the problem is partial dependencies that prevent proper register renaming. Consider pseudo-code such as:
1: add r1{k1}, r0, r2
2: sub r1{k2}, r0, r3
If masks k1 and k2 have any chance of overlapping, then the result of instruction 2 depends on the result of instruction 1 (needs to read-modify-write r1).
A straightforward OoO implementation would assume a data-dependency and schedule instruction 2 once the result of instruction 1 is available. This is much worse than running in-order and merging during write-back.
Of course, in most cases k1 and k2 won't overlap (this is typically an if-then-else that went through if-conversion).
This is a similar issue as mixing AVX and SSE instructions, or 16-bit and 32-bit x86. Legacy instructions that only update the lower part of registers create partial dependencies. At least on Ivy Bridge, the retribution for forgetting a vzeroupper is a ~50-cycle penalty...
Except it gets much worse with masked instructions, since the register subset that an instruction updates is unknown at schedule time, instead of being nicely encoded in the instruction...
I am really curious to know how Skylake (or KNL if OoO) intend to handle this.
I guess the easiest way would be to crack every predicated instruction into 2 µops:
add tmp, r0, r2
merge r1, k1, r1, tmp
sub tmp, r0, r3
merge r1, k2, r1, tmp
But this is not so cheap, and merges still need to be done in-order.
Topic | Posted By | Date |
---|---|---|
Knights Landing CPU Speculation | David Kanter | 2013/11/18 02:03 AM |
Knights Landing CPU Speculation | none | 2013/11/18 02:59 AM |
Knights Landing CPU Speculation | Patrick Chase | 2013/11/23 03:18 PM |
Knights Landing CPU Speculation | 2013/11/26 01:20 AM | |
Over 2,000 mm^2 of eDRAM??? | Mark Roulo | 2013/11/26 09:28 AM |
Over 2,000 mm^2 of eDRAM??? | David Kanter | 2013/11/26 11:09 AM |
Over 2,000 mm^2 of eDRAM??? | Eric Bron | 2013/11/26 11:21 AM |
Over 2,000 mm^2 of eDRAM??? | tarlinian | 2013/11/26 11:50 AM |
Over 2,000 mm^2 of eDRAM??? | Eric Bron | 2013/11/26 01:07 PM |
Over 2,000 mm^2 of eDRAM??? | Eric Bron | 2013/11/26 01:09 PM |
Over 2,000 mm^2 of eDRAM??? | aaron spink | 2013/11/26 03:03 PM |
Over 2,000 mm^2 of eDRAM??? | Eric Bron | 2013/11/26 11:42 PM |
Over 2,000 mm^2 of eDRAM??? | aaron spink | 2013/11/27 10:31 AM |
Over 2,000 mm^2 of eDRAM??? | David Kanter | 2013/11/26 04:25 PM |
Over 2,000 mm^2 of eDRAM??? | tarlinian | 2013/11/26 07:01 PM |
Over 2,000 mm^2 of eDRAM??? | Eric | 2013/11/27 02:54 AM |
eDRAM is DRAM in a logic-oriented process | Paul A. Clayton | 2013/11/27 07:10 AM |
Knights Landing CPU Speculation | James | 2013/11/18 05:26 AM |
Knights Landing CPU Speculation | Michael S | 2013/11/18 02:57 PM |
Knights Landing CPU Speculation | Urban Novak | 2013/11/19 12:49 AM |
Knights Landing CPU Speculation | none | 2013/11/19 01:19 AM |
Knights Landing CPU Speculation | Eric | 2013/11/19 07:48 PM |
Total GPGPU/Xeon Phi market maybe ~ $500M/year ... | Mark Roulo | 2013/11/20 10:35 AM |
Knights Landing CPU Speculation | Wes Felter | 2013/11/19 12:06 PM |
Knights Landing CPU Speculation | Michael S | 2013/11/19 12:49 PM |
Knights Landing CPU Speculation | Eric | 2013/11/18 12:17 PM |
Knights Landing CPU Speculation | Daniel | 2013/11/19 02:28 AM |
Knights Landing CPU Speculation | Eric | 2013/11/19 07:36 PM |
HPC guys score FLOPS non-obviously | Mark Roulo | 2013/11/20 10:43 AM |
3-TFlops-DGEMM | Michael S | 2013/11/20 10:59 AM |
3-TFlops-DGEMM | Mark Roulo | 2013/11/20 12:22 PM |
3-TFlops-DGEMM | Daniel | 2013/11/20 01:04 PM |
3-TFlops-DGEMM | Eric | 2013/11/21 01:28 AM |
3-TFlops-DGEMM | Michael S | 2013/11/21 05:48 AM |
3-TFlops-DGEMM | RecessionCone | 2013/11/21 11:13 AM |
3-TFlops-DGEMM | Michael S | 2013/11/21 02:34 PM |
3-TFlops-DGEMM | Eric | 2013/11/22 02:10 AM |
3-TFlops-DGEMM | Michael S | 2013/11/22 04:41 AM |
A (not very sensible) alternative: FMADD + FADD | Paul A. Clayton | 2013/11/22 08:19 AM |
3-TFlops-DGEMM | Sylvain Collange | 2013/11/24 02:37 AM |
3-TFlops-DGEMM | Michael S | 2013/11/24 06:06 AM |
3-TFlops-DGEMM | Sylvain Collange | 2013/11/24 09:28 AM |
HPC guys score FLOPS non-obviously | Patrick Chase | 2013/11/23 02:58 PM |
Knights Landing CPU Speculation | Paul Caheny | 2013/11/18 01:25 PM |
Knights Landing CPU Speculation | Konrad Schwarz | 2013/11/19 12:24 AM |
Knights Landing CPU Speculation | Amiba Gelos | 2013/11/19 07:36 PM |
Knights Landing CPU Speculation | David Kanter | 2013/11/20 09:52 AM |
Knights Landing CPU Speculation | Linus Torvalds | 2013/11/21 02:12 PM |
Knights Landing CPU Speculation | Amiba Gelos | 2013/11/21 05:14 PM |
Knights Landing CPU Speculation | Patrick Chase | 2013/11/23 03:33 PM |
Knights Landing CPU Speculation | Linus Torvalds | 2013/11/25 11:29 AM |
Knights Landing CPU Speculation | Linus Torvalds | 2013/11/25 12:05 PM |
Knights Landing CPU Speculation | Patrick Chase | 2013/11/25 12:22 PM |
Knights Landing CPU Speculation | Linus Torvalds | 2013/11/26 10:11 AM |
Knights Landing CPU Speculation | Eric | 2013/11/26 03:05 AM |
Knights Landing CPU Speculation | Eric | 2013/11/26 03:15 AM |
Knights Landing CPU Speculation | none | 2013/11/26 03:33 AM |
Knights Landing CPU Speculation | Eric | 2013/11/26 06:30 PM |
Knights Landing CPU Speculation | Eric | 2013/11/26 06:34 PM |
What is MCDRAM? | anon | 2013/11/26 08:58 PM |
What is MCDRAM? | none | 2013/11/27 01:00 AM |
What is MCDRAM? | Klimax | 2013/11/27 02:19 AM |
Knights Landing CPU Speculation | Klimax | 2013/11/26 11:06 PM |
Knights Landing CPU Speculation | Klimax | 2013/11/26 11:05 PM |
Knights Landing CPU Speculation | anon | 2013/11/26 05:53 AM |
Knights Landing CPU Speculation | none | 2013/11/26 06:20 AM |
Knights Landing CPU Speculation | Michael S | 2013/11/26 08:06 AM |
Knights Landing CPU Speculation | none | 2013/11/26 09:18 AM |
Knights Landing CPU Speculation | Eric Bron | 2013/11/26 01:21 PM |
Knights Landing CPU Speculation | Eric Bron | 2013/11/26 01:27 PM |
Knights Landing CPU Speculation | none | 2013/11/26 02:26 PM |
Knights Landing CPU Speculation | anon | 2013/11/26 05:42 PM |
Knights Landing CPU Speculation | none | 2013/11/27 01:08 AM |
Knights Landing CPU Speculation | anon | 2013/11/27 01:50 AM |
Knights Landing CPU Speculation | none | 2013/11/27 01:58 AM |
Knights Landing CPU Speculation | Michael S | 2013/11/27 01:25 AM |
Knights Landing CPU Speculation | anon | 2013/11/27 02:32 AM |
Knights Landing CPU Speculation | Michael S | 2013/11/27 03:08 AM |
Knights Landing CPU Speculation | Chung Leong | 2013/11/27 01:28 AM |
Knights Landing CPU Speculation | Michael S | 2013/11/27 02:53 AM |
Knights Landing CPU Speculation | Chung Leong | 2013/11/27 01:03 PM |
BiG.LiTTLe for KNL? | Jeff K | 2013/11/22 06:17 AM |
BiG.LiTTLe for KNL? | Patrick Chase | 2013/11/23 02:54 PM |
BiG.LiTTLe for KNL? | Patrick Chase | 2013/11/23 03:01 PM |
Transactional memory | Patrick Chase | 2013/11/23 02:37 PM |
Transactional memory | Bhima | 2013/11/25 07:01 AM |
Transactional memory | Patrick Chase | 2013/11/25 11:52 AM |
Knights Landing CPU Speculation | Daniel | 2013/11/25 02:17 AM |
Knights Landing CPU Speculation | Klimax | 2013/11/25 03:12 AM |
Knights Landing CPU Speculation | none | 2013/11/25 04:05 AM |
Knights Landing CPU Speculation | Klimax | 2013/11/25 04:45 AM |
Knights Landing CPU Speculation | none | 2013/11/25 04:55 AM |
Knights Landing CPU Speculation | gmb | 2013/11/25 07:21 AM |