3-TFlops-DGEMM

Article: Knights Landing CPU Speculation
By: Sylvain Collange (full.name.delete.delete@this.this.gmail.com), November 24, 2013 9:28 am
Room: Moderated Discussions
Michael S (already5chosen.delete@this.yahoo.com) on November 24, 2013 7:06 am wrote:
> Integer instructions are the smaller part of the problem. The bigger part are memory
> instructions.
> In my experience, for typical linear algebra algorithm with 32 sw visible registers
> it's pretty hard to reduce the number of memory accesses per to FMA below 0.7-0.8.
> And in that regard linear algebra is easier than most.

Agreed.

> I am not sure that Fermi/Kepler reference is relevant in discussion
> of KNL. I am sorry that I did it myself in a previous post.

Besides the instruction set and SIMT/SIMD differences, aren't they close architectures targeted at the same market?

> Certainly not for KNC-style core, where load and OP are separate pipeline operations.
> For Bonnel-style core, with its CISC (or, if you want, TI TMS320C30/C40 -style) load+op pipeline - may be.
> But resulting core wouldn't resemble Bonnel/Saltwell, even less so Silvermont.

I thought KNC was handling load+op in the same pipeline. If so I do not understand the distinction with Bonnel.

> > I am much more skeptical about out-of-order execution of a fully mask-predicated instruction set...
>
> You mean, too many register inputs per uOP?
> I didn't look at AVX-512 at sufficient details. How many register inputs will be needed per FMA?

More than the number of inputs, the problem is partial dependencies that prevent proper register renaming. Consider pseudo-code such as:
1: add r1{k1}, r0, r2
2: sub r1{k2}, r0, r3

If masks k1 and k2 have any chance of overlapping, then the result of instruction 2 depends on the result of instruction 1 (needs to read-modify-write r1).
A straightforward OoO implementation would assume a data-dependency and schedule instruction 2 once the result of instruction 1 is available. This is much worse than running in-order and merging during write-back.
Of course, in most cases k1 and k2 won't overlap (this is typically an if-then-else that went through if-conversion).

This is a similar issue as mixing AVX and SSE instructions, or 16-bit and 32-bit x86. Legacy instructions that only update the lower part of registers create partial dependencies. At least on Ivy Bridge, the retribution for forgetting a vzeroupper is a ~50-cycle penalty...
Except it gets much worse with masked instructions, since the register subset that an instruction updates is unknown at schedule time, instead of being nicely encoded in the instruction...

I am really curious to know how Skylake (or KNL if OoO) intend to handle this.
I guess the easiest way would be to crack every predicated instruction into 2 µops:
add tmp, r0, r2
merge r1, k1, r1, tmp
sub tmp, r0, r3
merge r1, k2, r1, tmp
But this is not so cheap, and merges still need to be done in-order.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Knights Landing CPU SpeculationDavid Kanter2013/11/18 02:03 AM
  Knights Landing CPU Speculationnone2013/11/18 02:59 AM
    Knights Landing CPU SpeculationPatrick Chase2013/11/23 03:18 PM
      Knights Landing CPU Speculation2013/11/26 01:20 AM
        Over 2,000 mm^2 of eDRAM???Mark Roulo2013/11/26 09:28 AM
          Over 2,000 mm^2 of eDRAM???David Kanter2013/11/26 11:09 AM
          Over 2,000 mm^2 of eDRAM???Eric Bron2013/11/26 11:21 AM
            Over 2,000 mm^2 of eDRAM???tarlinian2013/11/26 11:50 AM
              Over 2,000 mm^2 of eDRAM???Eric Bron2013/11/26 01:07 PM
                Over 2,000 mm^2 of eDRAM???Eric Bron2013/11/26 01:09 PM
                Over 2,000 mm^2 of eDRAM???aaron spink2013/11/26 03:03 PM
                  Over 2,000 mm^2 of eDRAM???Eric Bron2013/11/26 11:42 PM
                    Over 2,000 mm^2 of eDRAM???aaron spink2013/11/27 10:31 AM
              Over 2,000 mm^2 of eDRAM???David Kanter2013/11/26 04:25 PM
                Over 2,000 mm^2 of eDRAM???tarlinian2013/11/26 07:01 PM
          Over 2,000 mm^2 of eDRAM???Eric2013/11/27 02:54 AM
            eDRAM is DRAM in a logic-oriented processPaul A. Clayton2013/11/27 07:10 AM
  Knights Landing CPU SpeculationJames2013/11/18 05:26 AM
    Knights Landing CPU SpeculationMichael S2013/11/18 02:57 PM
      Knights Landing CPU SpeculationUrban Novak2013/11/19 12:49 AM
        Knights Landing CPU Speculationnone2013/11/19 01:19 AM
          Knights Landing CPU SpeculationEric2013/11/19 07:48 PM
            Total GPGPU/Xeon Phi market maybe ~ $500M/year ...Mark Roulo2013/11/20 10:35 AM
      Knights Landing CPU SpeculationWes Felter2013/11/19 12:06 PM
        Knights Landing CPU SpeculationMichael S2013/11/19 12:49 PM
  Knights Landing CPU SpeculationEric2013/11/18 12:17 PM
    Knights Landing CPU SpeculationDaniel2013/11/19 02:28 AM
      Knights Landing CPU SpeculationEric2013/11/19 07:36 PM
      HPC guys score FLOPS non-obviouslyMark Roulo2013/11/20 10:43 AM
        3-TFlops-DGEMMMichael S2013/11/20 10:59 AM
          3-TFlops-DGEMMMark Roulo2013/11/20 12:22 PM
            3-TFlops-DGEMMDaniel2013/11/20 01:04 PM
              3-TFlops-DGEMMEric2013/11/21 01:28 AM
                3-TFlops-DGEMMMichael S2013/11/21 05:48 AM
                  3-TFlops-DGEMMRecessionCone2013/11/21 11:13 AM
                    3-TFlops-DGEMMMichael S2013/11/21 02:34 PM
                  3-TFlops-DGEMMEric2013/11/22 02:10 AM
                    3-TFlops-DGEMMMichael S2013/11/22 04:41 AM
                    A (not very sensible) alternative: FMADD + FADDPaul A. Clayton2013/11/22 08:19 AM
                  3-TFlops-DGEMMSylvain Collange2013/11/24 02:37 AM
                    3-TFlops-DGEMMMichael S2013/11/24 06:06 AM
                      3-TFlops-DGEMMSylvain Collange2013/11/24 09:28 AM
        HPC guys score FLOPS non-obviouslyPatrick Chase2013/11/23 02:58 PM
  Knights Landing CPU SpeculationPaul Caheny2013/11/18 01:25 PM
    Knights Landing CPU SpeculationKonrad Schwarz2013/11/19 12:24 AM
  Knights Landing CPU SpeculationAmiba Gelos2013/11/19 07:36 PM
    Knights Landing CPU SpeculationDavid Kanter2013/11/20 09:52 AM
    Knights Landing CPU SpeculationLinus Torvalds2013/11/21 02:12 PM
      Knights Landing CPU SpeculationAmiba Gelos2013/11/21 05:14 PM
      Knights Landing CPU SpeculationPatrick Chase2013/11/23 03:33 PM
        Knights Landing CPU SpeculationLinus Torvalds2013/11/25 11:29 AM
          Knights Landing CPU SpeculationLinus Torvalds2013/11/25 12:05 PM
          Knights Landing CPU SpeculationPatrick Chase2013/11/25 12:22 PM
            Knights Landing CPU SpeculationLinus Torvalds2013/11/26 10:11 AM
          Knights Landing CPU SpeculationEric2013/11/26 03:05 AM
            Knights Landing CPU SpeculationEric2013/11/26 03:15 AM
            Knights Landing CPU Speculationnone2013/11/26 03:33 AM
              Knights Landing CPU SpeculationEric2013/11/26 06:30 PM
                Knights Landing CPU SpeculationEric2013/11/26 06:34 PM
                  What is MCDRAM?anon2013/11/26 08:58 PM
                    What is MCDRAM?none2013/11/27 01:00 AM
                      What is MCDRAM?Klimax2013/11/27 02:19 AM
                  Knights Landing CPU SpeculationKlimax2013/11/26 11:06 PM
                Knights Landing CPU SpeculationKlimax2013/11/26 11:05 PM
            Knights Landing CPU Speculationanon2013/11/26 05:53 AM
              Knights Landing CPU Speculationnone2013/11/26 06:20 AM
                Knights Landing CPU SpeculationMichael S2013/11/26 08:06 AM
                  Knights Landing CPU Speculationnone2013/11/26 09:18 AM
                    Knights Landing CPU SpeculationEric Bron2013/11/26 01:21 PM
                      Knights Landing CPU SpeculationEric Bron2013/11/26 01:27 PM
                        Knights Landing CPU Speculationnone2013/11/26 02:26 PM
                    Knights Landing CPU Speculationanon2013/11/26 05:42 PM
                      Knights Landing CPU Speculationnone2013/11/27 01:08 AM
                        Knights Landing CPU Speculationanon2013/11/27 01:50 AM
                          Knights Landing CPU Speculationnone2013/11/27 01:58 AM
                      Knights Landing CPU SpeculationMichael S2013/11/27 01:25 AM
                        Knights Landing CPU Speculationanon2013/11/27 02:32 AM
                          Knights Landing CPU SpeculationMichael S2013/11/27 03:08 AM
    Knights Landing CPU SpeculationChung Leong2013/11/27 01:28 AM
      Knights Landing CPU SpeculationMichael S2013/11/27 02:53 AM
        Knights Landing CPU SpeculationChung Leong2013/11/27 01:03 PM
  BiG.LiTTLe for KNL?Jeff K2013/11/22 06:17 AM
    BiG.LiTTLe for KNL?Patrick Chase2013/11/23 02:54 PM
      BiG.LiTTLe for KNL?Patrick Chase2013/11/23 03:01 PM
  Transactional memoryPatrick Chase2013/11/23 02:37 PM
    Transactional memoryBhima2013/11/25 07:01 AM
      Transactional memoryPatrick Chase2013/11/25 11:52 AM
  Knights Landing CPU SpeculationDaniel2013/11/25 02:17 AM
    Knights Landing CPU SpeculationKlimax2013/11/25 03:12 AM
    Knights Landing CPU Speculationnone2013/11/25 04:05 AM
      Knights Landing CPU SpeculationKlimax2013/11/25 04:45 AM
        Knights Landing CPU Speculationnone2013/11/25 04:55 AM
          Knights Landing CPU Speculationgmb2013/11/25 07:21 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell green?