By: Michael S (already5chosen.delete@this.yahoo.com), March 19, 2017 4:28 am

Room: Moderated Discussions

Per Hesselgren (perhesselgren.delete@this.yahoo.se) on March 19, 2017 3:06 am wrote:

> Michael S (already5chosen.delete@this.yahoo.com) on March 18, 2017 10:21 am wrote:

> > Per Hesselgren (perhesselgren.delete@this.yahoo.se) on March 17, 2017 8:49 am wrote:

> > >

> > > Now I have got a Ryzen 1700 myself so I have some results.

> > > This is the single thread matrix multiply:

> > >

> > > Algorithm Ivy Bridge Excavator Ryzen

> > > ----n 8,09 8,25 5,22

> > > ----v 7,91 7,59 5,58

> > > ----u 7,79 4,56 2,5

> > > ----p 8,06 7,74 5,27

> > > ----t 3,08 6,35 4,94

> > > ----i 1,58 2,5 1,31

> > > ----b 4,19 6,26 3,87

> > > ----m 1,39 3,08 1,15

> > > ----w 2,22 3,66 1,97

> > > ----r 3,09 6,2 4,94

> > >

> > > The times in secs are not so interesting as the clocks are all different.

> > > But if we use the -n algorithm time as 100% index we get:

> > > Algorithm Ivy Bridge Excavator Ryzen

> > > -----n 100 100 100

> > > -----v 98 92 107

> > > -----u 96 55 48

> > > -----p 100 94 101

> > > -----t 38 77 95

> > > -----i 20 30 25

> > > -----b 52 76 74

> > > -----m 17 37 22

> > > -----w 27 44 38

> > > -----r 38 75 95

> >

> > Can you report it in FLOPs/core and FLOPs/(core*Hz) ?

> >

> > Results for Algorithm m will be sufficient, the rest of them are obviously doing something wrong.

> >

> As we have N³ multiplies and N³-N² adds 1 sec means around 2 GFLOPS.

So, the time reported is for a single multiplication of 1000x1000 matrices?

> -m Ivy Bridge=1.44 Excavator=0.65 Ryzen=1,74

> This is all single thread with clocks 3.3, 3.5 and 3.7 GHz

> -m Ivy Bridge=0.44 Excavator=0.19 Ryzen=0.47 GFLOPS/GHz

Thank you.

>

> This is an old compiler so I made some tests with 32-bit GCC for Ryzen:

> -i was the best with 0.78 secs

Still only 0.69 FLOPs/(core*Hz)

> For Raspberry Pi (1.2 GHz)

> -r was the best with 5.03 secs followed by

> -t at 5.04 secs

>

All x86 scores are EXTREMELY low. The test appears to not push IvyB/Excavator/Ryzen FPUs at all. The bottleneck is somewhere else. Most likely, compiler does not utilize SIMD at all. But even without SIMD and without FMA at FPU level all this cores shell be capable of ~1.5-1.8 FLOPs/(core*Hz)

As to Raspberry Pi, I don't know if it is pushing FPU or not.

Is your Raspberry Pi a BCM2837 running in 64-bit mode? ARM Cortex A53?

I didn't find FPU throughput numbers in A53 TRM. I am sure that other RWT posters (Wilco? none? Exophase?) can tell us.

> Michael S (already5chosen.delete@this.yahoo.com) on March 18, 2017 10:21 am wrote:

> > Per Hesselgren (perhesselgren.delete@this.yahoo.se) on March 17, 2017 8:49 am wrote:

> > >

> > > Now I have got a Ryzen 1700 myself so I have some results.

> > > This is the single thread matrix multiply:

> > >

> > > Algorithm Ivy Bridge Excavator Ryzen

> > > ----n 8,09 8,25 5,22

> > > ----v 7,91 7,59 5,58

> > > ----u 7,79 4,56 2,5

> > > ----p 8,06 7,74 5,27

> > > ----t 3,08 6,35 4,94

> > > ----i 1,58 2,5 1,31

> > > ----b 4,19 6,26 3,87

> > > ----m 1,39 3,08 1,15

> > > ----w 2,22 3,66 1,97

> > > ----r 3,09 6,2 4,94

> > >

> > > The times in secs are not so interesting as the clocks are all different.

> > > But if we use the -n algorithm time as 100% index we get:

> > > Algorithm Ivy Bridge Excavator Ryzen

> > > -----n 100 100 100

> > > -----v 98 92 107

> > > -----u 96 55 48

> > > -----p 100 94 101

> > > -----t 38 77 95

> > > -----i 20 30 25

> > > -----b 52 76 74

> > > -----m 17 37 22

> > > -----w 27 44 38

> > > -----r 38 75 95

> >

> > Can you report it in FLOPs/core and FLOPs/(core*Hz) ?

> >

> > Results for Algorithm m will be sufficient, the rest of them are obviously doing something wrong.

> >

> As we have N³ multiplies and N³-N² adds 1 sec means around 2 GFLOPS.

So, the time reported is for a single multiplication of 1000x1000 matrices?

> -m Ivy Bridge=1.44 Excavator=0.65 Ryzen=1,74

> This is all single thread with clocks 3.3, 3.5 and 3.7 GHz

> -m Ivy Bridge=0.44 Excavator=0.19 Ryzen=0.47 GFLOPS/GHz

Thank you.

>

> This is an old compiler so I made some tests with 32-bit GCC for Ryzen:

> -i was the best with 0.78 secs

Still only 0.69 FLOPs/(core*Hz)

> For Raspberry Pi (1.2 GHz)

> -r was the best with 5.03 secs followed by

> -t at 5.04 secs

>

All x86 scores are EXTREMELY low. The test appears to not push IvyB/Excavator/Ryzen FPUs at all. The bottleneck is somewhere else. Most likely, compiler does not utilize SIMD at all. But even without SIMD and without FMA at FPU level all this cores shell be capable of ~1.5-1.8 FLOPs/(core*Hz)

As to Raspberry Pi, I don't know if it is pushing FPU or not.

Is your Raspberry Pi a BCM2837 running in 64-bit mode? ARM Cortex A53?

I didn't find FPU throughput numbers in A53 TRM. I am sure that other RWT posters (Wilco? none? Exophase?) can tell us.