By: Maynard Handley (name99.delete@this.name99.org), October 20, 2017 8:20 pm

Room: Moderated Discussions

Maynard Handley (name99.delete@this.name99.org) on October 20, 2017 8:16 pm wrote:

> Maynard Handley (name99.delete@this.name99.org) on October 20, 2017 1:41 pm wrote:

> > dmcq (dmcq.delete@this.fano.co.uk) on October 20, 2017 7:26 am wrote:

> > > Maynard Handley (name99.delete@this.name99.org) on October 20, 2017 1:34 am wrote:

> > > > I mentioned some weeks ago that Wolfram was going to ship a Mathematica Player for iPad,

> > > > and that it would be an interesting performance comparison against x86 of a "serious"

> > > > app. Well the Player has been released and I've spent a few hours playing with it.

> > > >

> >

> > > Thanks very much, that's interesting. As far as cooperation is concerned that is what Linaro

> > > is all about. I see thay have recently added a high performance computing group

> > > Linaro: High Performance Computing (HPC)

> > > This I guess would be where most of what you see as missing would be done in the

> > > future I guess. But until now I think it has been up to ARM to do all the basics

> > > Welcome to the Arm HPC Ecosystem

> > > I'm not sure when they started to get serious about all this but their purchase of Allinea

> > > and the post K business with Fujitsu certainly indicates they are serious about it. (Hm

> > > - I wonder if Apple are looking at that Scalable Vector Extension facility)

> > >

> >

>

> OK, as promised

> We're running a quad core Ivy Bridge

> i7-3720QM - 2.6GHz - 3.6GHz 6MiB L3

> against an iPad Pro (3 Hurricane cores, 2.35GHz, 3+4MiB cache)

>

> Most of the benchmarks are the sub benchmarks of the Mathematica benchmark suite (though sometimes with some

> detail modified so that I can figure/time the part I care about rather than some other overhead). I've grouped

> them somewhat by common theme and added comments as to things I know (or think) explain the results:

>

> Data fitting:

> x86 0.44s

> A 0.54s

>

> 10^6 Digits of Pi:

> x86 0.35s

> A 4.6s (13x)

> (float multiprecision)

>

> Gamma function (factorials of five numbers near 80 000):

> x86 .042

> A .66 (16x)

> (int multiprecision)

>

> Large integer multiplication

> x86 .12

> A 2.2 (18x)

>

> Discrete Fourier transform (of floats):

> x86 .36 (runs the FT parallel, but not the Do)

> A 2.1 (does NOT run the FT parallel)

> (and parallel kernels not supported, so ParallelDo[] fails)

>

>

> Create a large (4000x4000 element square) random real matrix

> x86: .19

> A .17

>

> Invert a large (1000x1000) random real matrix

> x86 .066 (runs in parallel)

> A .52 (8x) does NOT run in parallel

>

> A.B.A^-1 (large real matrix arithmetic)

> x86 .11

> A 1.5 (13x) once again no parallelism (and no vectors?)

>

> Invert 200x200 integer matrix of 0,1

> x86 1.1

> A 3.8 this one IS parallel on iPad! so still 4:3, and no vectors?

>

> Create millions of random reals

> x86 .51s

> A .42s

>

> Create millions of random bits

> x86 .2s

> A .09s

>

> Create millions of random integers

> x86 .32s

> A .23s

>

> Calculate Sin and Exp of millions of reals

> x86 1s

> A 1.3s (definitely parallelized; shows 4:3 cores…)

>

> Calculate ArcTan of millions of reals

> x86 1s

> A 3.1s (both still parallelized. bad algorithm on iPad?)

>

> Millions of real divisions (~400MB of data, so limited by memory?)

> x86 1.1s Intel has two dividers?

> A .7s (!) Apple has three dividers?

> (both stay the same when divide changed to multiple, so actually memory test?)

>

> But if we change the details (now about 3.5MB of data, many more iterations)

> for multiplications:

> x86 1.5s

> A .8s

> for divisions

> x86 4.4s

> A 4.2s

> So Apple does even better — more multiply units? more but slower divide units?

>

> If we do the same sort of thing for integers (fp=N[int1/int2]) we get

> x86 1.8s

> A 1.9s

> (this is mix of int and fp, no parallelism or vectors)

>

> Real matrix powers

> x86 2.6s

> A 2.8s (this one clearly IS parallelized on both CPUs)

>

> Eigenvalues of a large matrix

> x86 .43s parallelized

> A 1.5s not parallelized

>

> Singular value decomposition

> x86 .48s

> A 3.6s (7.5x) not parallelized, (not vectorized?)

>

> Solve a linear system

> x86 .24s

> A 2.6s (11x) as above

>

> Transpose a large (real) matrix (~35MB data)

> x86 .68s

> A .55s better memory system

>

> Numerical integration

> x86 .63s

> A .56s

>

> Sort a million integers

> x86 2.2s

> A 2.2s

>

> Sort a million reals

> x86 2.4s

> A 2.6s

>

> Expanding large polynomials (with REAL coefficients)

> x86 .9s

> A .8s

>

> Symbolic integration

> x86 .91s

> A .54s

>

>

> I'm actually stunned at how well the Apple chip does. I think those who continue to insist that

> this is a toy chip incapable of real work are going to be eating their words, and I'm more convinced

> than ever that Apple has grand dreams ahead of them driven by their CPUs (and GPUs? NPUs?)

>

> Like I said, for single-threaded non-vectorized tasks, Apple basically matches x86,

> in spite of the 3.6/2.35=1.5x frequency boost. For the most memory intensive tasks (large

> transpose, simple calculations over very long vectors) we're seeing a superior memory

> system (and that's before the spectacular A11 improvements in this regard).

>

> It's mildly amusing that, even though Apple has less absolute FP capacity than Intel (3

> 128wide units vs 2 256wide units) for code that operates on a scalar at a time, Apple can

> run the three units simultaneously, while Intel can only run two simultaneously. Hence

> we see the divergence for the "now about 3.5MB of data, many more iterations" loops.

>

> Obviously there is, to repeat the point one again, missing algorithmic functionality in the form of

> - any? vector support

> - very little parallelism support

> - terrible multi-precision handling

>

> But the baseline hardware is remarkable!

> Obviously Intel has another 50% or so of reserve capacity in the form of running at 5GHz or so, and

> various minor IPC boosts from Ivy Bridge to Koby Lake. Meanwhile Apple has (apparently) a ~25% boost

> in the A11 and while there will clearly be some code that will not scale up (anything that's basically

> already #fp units * frequency) there's probably enough stuff that's still throttled by memory or branch

> that it will pick up value from that IPC boost. And while it's probably unreasonable to assume that

> Apple could crank up an A11 to 5GHz, they likely could crank it up to 3GHz at acceptable power levels

> to pick up that remaining 25%, and so match Intel for desktop Mathematica?

>

> Certainly I think it's time to put to rest this on-going nonsense that "GeekBench is all in cache,

> but Intel slays when you go beyond cache". Most of the problems that I benchmarked (except in

> cases where I was specifically trying to test in-cache performance) were larger to substantially

> larger than either device's cache (generally 10s to 100s of MB), and likewise Mathematica's main

> execution loop is a fairly hefty chunk of code, especially for things like the polynomial expansion/simplification,

> and symbolic integration, substantially beyond I-cache size.

No, the formatting still sucks.

OK, well I have better things to do than work around a system that insists on displaying tabs and multiple spaces ONE WAY when you're editing, and then in a completely different way when you're presenting the page, and whose "Code" setting doesn't seem to understand that "HONORING THE DAMN NUMBER OF SPACES IS THE SINGLE MOST IMPORTANT JOB YOU HAVE!!!"

Sorry guys, you have to read it as randomly formatted crap.

> Maynard Handley (name99.delete@this.name99.org) on October 20, 2017 1:41 pm wrote:

> > dmcq (dmcq.delete@this.fano.co.uk) on October 20, 2017 7:26 am wrote:

> > > Maynard Handley (name99.delete@this.name99.org) on October 20, 2017 1:34 am wrote:

> > > > I mentioned some weeks ago that Wolfram was going to ship a Mathematica Player for iPad,

> > > > and that it would be an interesting performance comparison against x86 of a "serious"

> > > > app. Well the Player has been released and I've spent a few hours playing with it.

> > > >

> >

> > > Thanks very much, that's interesting. As far as cooperation is concerned that is what Linaro

> > > is all about. I see thay have recently added a high performance computing group

> > > Linaro: High Performance Computing (HPC)

> > > This I guess would be where most of what you see as missing would be done in the

> > > future I guess. But until now I think it has been up to ARM to do all the basics

> > > Welcome to the Arm HPC Ecosystem

> > > I'm not sure when they started to get serious about all this but their purchase of Allinea

> > > and the post K business with Fujitsu certainly indicates they are serious about it. (Hm

> > > - I wonder if Apple are looking at that Scalable Vector Extension facility)

> > >

> >

>

> OK, as promised

> We're running a quad core Ivy Bridge

> i7-3720QM - 2.6GHz - 3.6GHz 6MiB L3

> against an iPad Pro (3 Hurricane cores, 2.35GHz, 3+4MiB cache)

>

> Most of the benchmarks are the sub benchmarks of the Mathematica benchmark suite (though sometimes with some

> detail modified so that I can figure/time the part I care about rather than some other overhead). I've grouped

> them somewhat by common theme and added comments as to things I know (or think) explain the results:

>

> Data fitting:

> x86 0.44s

> A 0.54s

>

> 10^6 Digits of Pi:

> x86 0.35s

> A 4.6s (13x)

> (float multiprecision)

>

> Gamma function (factorials of five numbers near 80 000):

> x86 .042

> A .66 (16x)

> (int multiprecision)

>

> Large integer multiplication

> x86 .12

> A 2.2 (18x)

>

> Discrete Fourier transform (of floats):

> x86 .36 (runs the FT parallel, but not the Do)

> A 2.1 (does NOT run the FT parallel)

> (and parallel kernels not supported, so ParallelDo[] fails)

>

>

> Create a large (4000x4000 element square) random real matrix

> x86: .19

> A .17

>

> Invert a large (1000x1000) random real matrix

> x86 .066 (runs in parallel)

> A .52 (8x) does NOT run in parallel

>

> A.B.A^-1 (large real matrix arithmetic)

> x86 .11

> A 1.5 (13x) once again no parallelism (and no vectors?)

>

> Invert 200x200 integer matrix of 0,1

> x86 1.1

> A 3.8 this one IS parallel on iPad! so still 4:3, and no vectors?

>

> Create millions of random reals

> x86 .51s

> A .42s

>

> Create millions of random bits

> x86 .2s

> A .09s

>

> Create millions of random integers

> x86 .32s

> A .23s

>

> Calculate Sin and Exp of millions of reals

> x86 1s

> A 1.3s (definitely parallelized; shows 4:3 cores…)

>

> Calculate ArcTan of millions of reals

> x86 1s

> A 3.1s (both still parallelized. bad algorithm on iPad?)

>

> Millions of real divisions (~400MB of data, so limited by memory?)

> x86 1.1s Intel has two dividers?

> A .7s (!) Apple has three dividers?

> (both stay the same when divide changed to multiple, so actually memory test?)

>

> But if we change the details (now about 3.5MB of data, many more iterations)

> for multiplications:

> x86 1.5s

> A .8s

> for divisions

> x86 4.4s

> A 4.2s

> So Apple does even better — more multiply units? more but slower divide units?

>

> If we do the same sort of thing for integers (fp=N[int1/int2]) we get

> x86 1.8s

> A 1.9s

> (this is mix of int and fp, no parallelism or vectors)

>

> Real matrix powers

> x86 2.6s

> A 2.8s (this one clearly IS parallelized on both CPUs)

>

> Eigenvalues of a large matrix

> x86 .43s parallelized

> A 1.5s not parallelized

>

> Singular value decomposition

> x86 .48s

> A 3.6s (7.5x) not parallelized, (not vectorized?)

>

> Solve a linear system

> x86 .24s

> A 2.6s (11x) as above

>

> Transpose a large (real) matrix (~35MB data)

> x86 .68s

> A .55s better memory system

>

> Numerical integration

> x86 .63s

> A .56s

>

> Sort a million integers

> x86 2.2s

> A 2.2s

>

> Sort a million reals

> x86 2.4s

> A 2.6s

>

> Expanding large polynomials (with REAL coefficients)

> x86 .9s

> A .8s

>

> Symbolic integration

> x86 .91s

> A .54s

>

>

> I'm actually stunned at how well the Apple chip does. I think those who continue to insist that

> this is a toy chip incapable of real work are going to be eating their words, and I'm more convinced

> than ever that Apple has grand dreams ahead of them driven by their CPUs (and GPUs? NPUs?)

>

> Like I said, for single-threaded non-vectorized tasks, Apple basically matches x86,

> in spite of the 3.6/2.35=1.5x frequency boost. For the most memory intensive tasks (large

> transpose, simple calculations over very long vectors) we're seeing a superior memory

> system (and that's before the spectacular A11 improvements in this regard).

>

> It's mildly amusing that, even though Apple has less absolute FP capacity than Intel (3

> 128wide units vs 2 256wide units) for code that operates on a scalar at a time, Apple can

> run the three units simultaneously, while Intel can only run two simultaneously. Hence

> we see the divergence for the "now about 3.5MB of data, many more iterations" loops.

>

> Obviously there is, to repeat the point one again, missing algorithmic functionality in the form of

> - any? vector support

> - very little parallelism support

> - terrible multi-precision handling

>

> But the baseline hardware is remarkable!

> Obviously Intel has another 50% or so of reserve capacity in the form of running at 5GHz or so, and

> various minor IPC boosts from Ivy Bridge to Koby Lake. Meanwhile Apple has (apparently) a ~25% boost

> in the A11 and while there will clearly be some code that will not scale up (anything that's basically

> already #fp units * frequency) there's probably enough stuff that's still throttled by memory or branch

> that it will pick up value from that IPC boost. And while it's probably unreasonable to assume that

> Apple could crank up an A11 to 5GHz, they likely could crank it up to 3GHz at acceptable power levels

> to pick up that remaining 25%, and so match Intel for desktop Mathematica?

>

> Certainly I think it's time to put to rest this on-going nonsense that "GeekBench is all in cache,

> but Intel slays when you go beyond cache". Most of the problems that I benchmarked (except in

> cases where I was specifically trying to test in-cache performance) were larger to substantially

> larger than either device's cache (generally 10s to 100s of MB), and likewise Mathematica's main

> execution loop is a fairly hefty chunk of code, especially for things like the polynomial expansion/simplification,

> and symbolic integration, substantially beyond I-cache size.

No, the formatting still sucks.

OK, well I have better things to do than work around a system that insists on displaying tabs and multiple spaces ONE WAY when you're editing, and then in a completely different way when you're presenting the page, and whose "Code" setting doesn't seem to understand that "HONORING THE DAMN NUMBER OF SPACES IS THE SINGLE MOST IMPORTANT JOB YOU HAVE!!!"

Sorry guys, you have to read it as randomly formatted crap.

Data fitting:

x86 0.44s

A 0.54s

10^6 Digits of Pi:

x86 0.35s

A 4.6s (13x)

(float multiprecision)

Gamma function (factorials of five numbers near 80 000):

x86 .042

A .66 (16x)

(int multiprecision)

Large integer multiplication

x86 .12

A 2.2 (18x)

Discrete Fourier transform (of floats):

x86 .36 (runs the FT parallel, but not the Do)

A 2.1 (does NOT run the FT parallel)

(and parallel kernels not supported, so ParallelDo[] fails)

Create a large (4000x4000 element square) random real matrix

x86: .19

A .17

Invert a large (1000x1000) random real matrix

x86 .066 (runs in parallel)

A .52 (8x) does NOT run in parallel

A.B.A^-1 (large real matrix arithmetic)

x86 .11

A 1.5 (13x) once again no parallelism (and no vectors?)

Invert 200x200 integer matrix of 0,1

x86 1.1

A 3.8 this one IS parallel on iPad! so still 4:3, and no vectors?

Create millions of random reals

x86 .51s

A .42s

Create millions of random bits

x86 .2s

A .09s

Create millions of random integers

x86 .32s

A .23s

Calculate Sin and Exp of millions of reals

x86 1s

A 1.3s (definitely parallelized; shows 4:3 cores…)

Calculate ArcTan of millions of reals

x86 1s

A 3.1s (both still parallelized. bad algorithm on iPad?)

Millions of real divisions (~400MB of data, so limited by memory?)

x86 1.1s Intel has two dividers?

A .7s (!) Apple has three dividers?

(both stay the same when divide changed to multiple, so actually memory test?)

But if we change the details (now about 3.5MB of data, many more iterations)

for multiplications:

x86 1.5s

A .8s

for divisions

x86 4.4s

A 4.2s

So Apple does even better — more multiply units? more but slower divide units?

If we do the same sort of thing for integers (fp=N[int1/int2]) we get

x86 1.8s

A 1.9s

(this is mix of int and fp, no parallelism or vectors)

Real matrix powers

x86 2.6s

A 2.8s (this one clearly IS parallelized on both CPUs)

Eigenvalues of a large matrix

x86 .43s parallelized

A 1.5s not parallelized

Singular value decomposition

x86 .48s

A 3.6s (7.5x) not parallelized, (not vectorized?)

Solve a linear system

x86 .24s

A 2.6s (11x) as above

Transpose a large (real) matrix (~35MB data)

x86 .68s

A .55s better memory system

Numerical integration

x86 .63s

A .56s

Sort a million integers

x86 2.2s

A 2.2s

Sort a million reals

x86 2.4s

A 2.6s

Expanding large polynomials (with REAL coefficients)

x86 .9s

A .8s

Symbolic integration

x86 .91s

A .54s

Topic | Posted By | Date |
---|---|---|

Mathematica on iPad | Maynard Handley | 2017/10/20 01:34 AM |

Mathematica on iPad | dmcq | 2017/10/20 07:26 AM |

Mathematica on iPad | Maynard Handley | 2017/10/20 01:41 PM |

Mathematica on iPad | Maynard Handley | 2017/10/20 08:16 PM |

Does this give better formatting? | Maynard Handley | 2017/10/20 08:20 PM |

Does this give better formatting? | anon | 2017/10/20 09:37 PM |

Does this give better formatting? | Maynard Handley | 2017/10/20 10:29 PM |

Does this give better formatting? | anon | 2017/10/21 12:52 AM |

Does this give better formatting? | Maynard Handley | 2017/10/21 09:48 AM |

Does this give better formatting? | anon | 2017/10/21 10:01 AM |

Mathematica on iPad | Adrian | 2017/10/21 01:49 AM |

Sorry for the typo | Adrian | 2017/10/21 01:51 AM |

Mathematica on iPad | dmcq | 2017/10/21 07:03 AM |

Mathematica on iPad | Maynard Handley | 2017/10/21 09:58 AM |

Mathematica on iPad | Wilco | 2017/10/21 07:16 AM |

Mathematica on iPad | Doug S | 2017/10/21 09:02 AM |

Mathematica on iPad | Megol | 2017/10/22 05:24 AM |

clang __builtin_addcll | Michael S | 2017/10/21 11:05 AM |

Mathematica on iPad | Maynard Handley | 2017/10/21 09:55 AM |

Mathematica on iPad | Anon | 2017/10/21 04:20 PM |

Mathematica on iPad | Maynard Handley | 2017/10/21 05:51 PM |

Mathematica on iPad | Anon | 2017/10/21 09:56 PM |

Mathematica on iPad | Maynard Handley | 2017/10/22 12:23 AM |

A quick search shows that Mathematica is using Intel MKL | Gabriele Svelto | 2017/10/21 11:38 PM |

A quick search shows that Mathematica is using Intel MKL | Anon | 2017/10/22 05:12 PM |

A quick search shows that Mathematica is using Intel MKL | Maynard Handley | 2017/10/22 06:08 PM |

A quick search shows that Mathematica is using Intel MKL | Doug S | 2017/10/22 10:40 PM |

A quick search shows that Mathematica is using Intel MKL | Michael S | 2017/10/23 05:32 AM |

Mathematica on iPad | none | 2017/10/22 06:06 AM |

Mathematica on iPad | dmcq | 2017/10/23 03:43 AM |