Haswell SGEMM

By: Michael S (already5chosen.delete@this.yahoo.com), March 16, 2017 11:39 am
Room: Moderated Discussions
Gian-Carlo Pascutto (gcp.delete@this.sjeng.org) on March 14, 2017 1:27 am wrote:
> Michael S (already5chosen.delete@this.yahoo.com) on March 12, 2017 3:24 am wrote:
> > 1939 appears correct.
> > 1650 is hard to believe, but theoretically possible.
> > 1654 - impossible. tput should be 0.5c
> There are strange things going on with Ryzen FPU/AVX performance.
> In my small BLAS workload (M=100 N=300 K=1000), it outperforms
> Haswell. Despite there being no optimized kernel.
> http://computer-go.org/pipermail/computer-go/2017-March/009951.html
> --

I dug out oldish MKL and run SGEMM (in fact cblas_sgem(), I don't have Fortran compiler for "real" SGEMM and don't have a desire to learn how to call Fortran from C++) with parameters specified above (and with Order=CblasRowMajor, TransA=CblasNoTrans, TransB=CblasNoTrans, alpha=1, beta=0, lda=K, ldb=N, ldc=N) on IvyBridge and Haswell.
I didn't validate yet that I did everything right.
Preliminary results (single core):
IvyBridge - 12.6 FLOPs/Hz
Haswell - 15.1 FLOPs/Hz

May be, that's because my MKL11.1 is old... But in realease notes they say the they optimized for AVX2. I'd assume that it means that they also optimized for FMA.

So, let's assume that MKL used by Gian-Carlo Pascutto is no better than mine.
He said that MKL is about 25% faster than OpenBlas. Then OpenBlas on Haswell would be ~12 FLOPs/Hz.
If Ryzen also does 12 FLOPs/Hz then it has pretty good efficiency (75% of theoretical peak) on this workload, but it's not something unusual.

I wonder, how fast I can make HSWL run SGEMM for this data size. On paper, 20-22 FLOPs/Hz appear easy.
Unfortunately, currently my hobby coding capacity fully consumed by another project.
So, it's going to be job for somebody else. Or for me, but for later.

TopicPosted ByDate
Haswell SGEMM Michael S2017/03/16 11:39 AM
Reply to this Topic
Body: No Text
How do you spell green?