By: Travis Downs, December 13, 2018 8:39 am
anon ( on December 13, 2018 8:06 am wrote:
> -.- ( on December 13, 2018 3:37 am wrote:
> > Seni ( on December 12, 2018 1:58 pm wrote:
> > > -several signs point to moderately lower clock frequency than Skylake family. The renamer is one, but
> > > also the enlarged caches (L1, uop cache and tlb), the dual store port LSU, and a few other areas.
> >
> > I'm a little curious about the AVX-512 units on Sunny Cove.
> > Skylake has 2x 512b units, where ports 0 and 1 combine to form one unit. To accommodate
> > this, Skylake made supported vector operations on these ports largely the same. Also,
> > instructions execute on port 0, only borrowing the vector unit of port 1.
> >
> > Sunny Cove adds vector shuffle to port 1, which means that port 0 and 1 now support different
> > vector instructions. Curious to note that this is port 1 and not 0, which would be the logical
> > place to put it if it had 2x 512b units. Also, I suspect that permutes (assuming 'shuffle' includes
> > cross-lane permutes) are probably best done on units which have the full width, rather than trying
> > to split the instruction into halves and dealing with the the complexity of such.
> >
> > So perhaps this means...
> >
> > - port 1 remains 256b, and the shuffle unit can only do 256b shuffles, meaning
> > the CPU can do either 2x 256b (port 1+5) or 1x 512b (port 5 only) per clock?
> > - port 1 is extended to 512b, suggesting that port 0 is also, so the CPU has 3x 512b ports and can do
> > 3x 512b FMA per cycle? (assuming non-'gimped' AVX-512 CPUs like the single port AVX-512 Skylakes)
> > - some other scheme, like 256b port 1 but a 512b shuffle unit (not
> > clued enough on CPU design to know the feasibility of this)?
> >
> > What do you think?
> > 3x 512b units seems the most likely to me - gives more FLOPS, which Intel likes, but the power usage...
> Keep in mind that the ports do not necessarily need to support the
> same uops for 256b and 512b. Port 5 doesn't do 256b FMA iirc.
> So both 256b shuffle on port 1, but no 512b shuffle on port 0 and 256b shuffle on port 1, 512b
> shuffle on port 0 (which reuses the same unit and it's simply on port 1 for 256b because div
> is already on port 0), but no 256b shuffle on port 0 are equally possible and plausible.
> 3x 512b shuffle seems unlikely and throwing in FMA as well is
> even more unlikely since the two are not exactly related.

If I understood the suggested possibility, it was 3 512b units but only 2 512b shuffles (p1 and p5).

> Either way two shuffle ports are nice to have. Even if they are not 512b it's
> still nice getting the same throughput with 256b without having to use AVX512
> or getting a massive boost when you actually want independent 256b shuffles.

Definitely. It's been a decade of writing code that uses more blends but fewer shuffles or whatever to get around p5 shuffle pressure. This would provide a nice boost to a lot of those things (after a re-write though!).

