By: Michael S (already5chosen.delete@this.yahoo.com), September 20, 2021 5:12 am
Room: Moderated Discussions
dmcq (dmcq.delete@this.fano.co.uk) on September 20, 2021 2:19 am wrote:
>
> It'd be interesting to see how Supercomputer Fugaku fares thst way as they said they were doing special work
> accessing two cache lines at once to deal with their 512 bit reads and writes being split in two like that.
I suspect that what they consider "special work" was a norm in the realm of "big" x86 cores since introduction of AMD Hounds line (a.k.a. Barcelona/Shanghai/Istanbul) back in 2007. Except, of course, that SIMD data paths on Fujitsu A64FX are 4 times wider than they were on AMD Hounds.
On Fugaku, for 512-bit L1D-resident unaligned SIMD loads I'd expect slightly more than 2x penalty in throughput relatively to aligned case. Exactly the same as I expect on Intel SKX.
Now, for unaligned SIMD loads *latency* in case of lightly loaded LSU units, I'd expect no penalty on Fugaku vs 1-3 cycles of penalty on SKX, but that's not because Fugaku's unaligned load latency is low, but because its aligned load latency is unusually high.
>
> It'd be interesting to see how Supercomputer Fugaku fares thst way as they said they were doing special work
> accessing two cache lines at once to deal with their 512 bit reads and writes being split in two like that.
I suspect that what they consider "special work" was a norm in the realm of "big" x86 cores since introduction of AMD Hounds line (a.k.a. Barcelona/Shanghai/Istanbul) back in 2007. Except, of course, that SIMD data paths on Fujitsu A64FX are 4 times wider than they were on AMD Hounds.
On Fugaku, for 512-bit L1D-resident unaligned SIMD loads I'd expect slightly more than 2x penalty in throughput relatively to aligned case. Exactly the same as I expect on Intel SKX.
Now, for unaligned SIMD loads *latency* in case of lightly loaded LSU units, I'd expect no penalty on Fugaku vs 1-3 cycles of penalty on SKX, but that's not because Fugaku's unaligned load latency is low, but because its aligned load latency is unusually high.