Article: Knights Landing CPU Speculation
By: Michael S (already5chosen.delete@this.yahoo.com), November 24, 2013 6:06 am
Room: Moderated Discussions
Sylvain Collange (firstname.lastname.delete@this.gmail.com) on November 24, 2013 3:37 am wrote:
> Michael S (already5chosen.delete@this.yahoo.com) on November 21, 2013 6:48 am wrote:
> > 2 FPUs on 2-issue core? That's silly. 2-issue is barely enough to keep one FPU reasonably busy.
> That is certainly true for most scalar workloads, but vector-intensive code can easily saturate a single
> vector unit. An FMA pipeline typically runs SIMD integer instructions in addition to FP instructions.

Integer instructions are the smaller part of the problem. The bigger part are memory instructions.
In my experience, for typical linear algebra algorithm with 32 sw visible registers it's pretty hard to reduce the number of memory accesses per to FMA below 0.7-0.8.
And in that regard linear algebra is easier than most.

> In SPMD-style code such as OpenCL, every variable is a vector unless the compiler can prove it holds the same
> value for all threads of a warp. Even assuming an omniscient compiler, scalar instructions only represent
> about 30% of the instruction mix, and less with agressive unrolling. Thus vector performance matters.
> Fermi and Kepler already have 2 FMAs for each scheduler, and can
> sustain the peak issue rate on a 100% FMA instruction mix.

I am not sure that Fermi/Kepler reference is relevant in discussion of KNL. I am sorry that I did it myself in a previous post.

> A 2-issue core with dual-FMA is the most sensible option in my opinion.

Certainly not for KNC-style core, where load and OP are separate pipeline operations.
For Bonnel-style core, with its CISC (or, if you want, TI TMS320C30/C40 -style) load+op pipeline - may be.
But resulting core wouldn't resemble Bonnel/Saltwell, even less so Silvermont.

> I am much more skeptical about out-of-order execution of a fully mask-predicated instruction set...

You mean, too many register inputs per uOP?
I didn't look at AVX-512 at sufficient details. How many register inputs will be needed per FMA?

