By: Andrei F (andrei.delete@this.anandtech.com),
Room: Moderated Discussions
Z (noreply.delete@this.127.0.0.1) on November 11, 2020 8:32 am wrote:
> Andrei F (andrei.delete@this.anandtech.com) on November 11, 2020 8:13 am wrote:
> > Dummond D. Slow (mental.delete@this.protozoa.us) on November 11, 2020 6:08 am wrote:
> > > Hmm unless that is supposed to mean that it does 2x the ops Intel and AMD can do. But I doubt Intel and AMD
> > > only do 2x 256bit versus Apple's 4x128. Not completely sure they only have 2x throughput but it's possible.
> > >
> >
> > That's exactly what's happening, it's 2x256 vs 4x128.
>
> Intel can do 2x512 or 3x256. AMD can do 4x256. Although not
> all execution pipes can perform every type of operations.
No, Intel can do 1x512 or 2x256 per cycle on consumer Ice Lake and Sunny Cove. The actual SIMD width of the pipelies is 2x256.
AMD did change from 2x256 to a distributed 4x256 on Zen3, yes, you're right, but it's still 2x256 for respectively ADDs/FMUL - though you can mix them now. I don't actually know what the practical limit here is as the core still only do 2x256b loads per cycle, it raises the utilisation for mixed workloads but I'm not sure on the actual throughput boost is.
> Andrei F (andrei.delete@this.anandtech.com) on November 11, 2020 8:13 am wrote:
> > Dummond D. Slow (mental.delete@this.protozoa.us) on November 11, 2020 6:08 am wrote:
> > > Hmm unless that is supposed to mean that it does 2x the ops Intel and AMD can do. But I doubt Intel and AMD
> > > only do 2x 256bit versus Apple's 4x128. Not completely sure they only have 2x throughput but it's possible.
> > >
> >
> > That's exactly what's happening, it's 2x256 vs 4x128.
>
> Intel can do 2x512 or 3x256. AMD can do 4x256. Although not
> all execution pipes can perform every type of operations.
No, Intel can do 1x512 or 2x256 per cycle on consumer Ice Lake and Sunny Cove. The actual SIMD width of the pipelies is 2x256.
AMD did change from 2x256 to a distributed 4x256 on Zen3, yes, you're right, but it's still 2x256 for respectively ADDs/FMUL - though you can mix them now. I don't actually know what the practical limit here is as the core still only do 2x256b loads per cycle, it raises the utilisation for mixed workloads but I'm not sure on the actual throughput boost is.


