Does this give better formatting?

By: anon (anon.delete@this.anon.com), October 20, 2017 9:37 pm
Room: Moderated Discussions
Maynard Handley (name99.delete@this.name99.org) on October 20, 2017 8:20 pm wrote:
> Maynard Handley (name99.delete@this.name99.org) on October 20, 2017 8:16 pm wrote:
> > Maynard Handley (name99.delete@this.name99.org) on October 20, 2017 1:41 pm wrote:
> > > dmcq (dmcq.delete@this.fano.co.uk) on October 20, 2017 7:26 am wrote:
> > > > Maynard Handley (name99.delete@this.name99.org) on October 20, 2017 1:34 am wrote:
> > > > > I mentioned some weeks ago that Wolfram was going to ship a Mathematica Player for iPad,
> > > > > and that it would be an interesting performance comparison against x86 of a "serious"
> > > > > app. Well the Player has been released and I've spent a few hours playing with it.
> > > > >
> > >
> > > > Thanks very much, that's interesting. As far as cooperation is concerned that is what Linaro
> > > > is all about. I see thay have recently added a high performance computing group
> > > > Linaro: High Performance Computing (HPC)
> > > > This I guess would be where most of what you see as missing would be done in the
> > > > future I guess. But until now I think it has been up to ARM to do all the basics
> > > > Welcome to the Arm HPC Ecosystem
> > > > I'm not sure when they started to get serious about all this but their purchase of Allinea
> > > > and the post K business with Fujitsu certainly indicates they are serious about it. (Hm
> > > > - I wonder if Apple are looking at that Scalable Vector Extension facility)
> > > >
> > >
> >
> > OK, as promised
> > We're running a quad core Ivy Bridge
> > i7-3720QM - 2.6GHz - 3.6GHz 6MiB L3
> > against an iPad Pro (3 Hurricane cores, 2.35GHz, 3+4MiB cache)
> >
> > Most of the benchmarks are the sub benchmarks of the Mathematica
> > benchmark suite (though sometimes with some
> > detail modified so that I can figure/time the part I care
> > about rather than some other overhead). I've grouped
> > them somewhat by common theme and added comments as to things I know (or think) explain the results:
> >
> > Data fitting:
> > x86 0.44s
> > A 0.54s
> >
> > 10^6 Digits of Pi:
> > x86 0.35s
> > A 4.6s (13x)
> > (float multiprecision)
> >
> > Gamma function (factorials of five numbers near 80 000):
> > x86 .042
> > A .66 (16x)
> > (int multiprecision)
> >
> > Large integer multiplication
> > x86 .12
> > A 2.2 (18x)
> >
> > Discrete Fourier transform (of floats):
> > x86 .36 (runs the FT parallel, but not the Do)
> > A 2.1 (does NOT run the FT parallel)
> > (and parallel kernels not supported, so ParallelDo[] fails)
> >
> >
> > Create a large (4000x4000 element square) random real matrix
> > x86: .19
> > A .17
> >
> > Invert a large (1000x1000) random real matrix
> > x86 .066 (runs in parallel)
> > A .52 (8x) does NOT run in parallel
> >
> > A.B.A^-1 (large real matrix arithmetic)
> > x86 .11
> > A 1.5 (13x) once again no parallelism (and no vectors?)
> >
> > Invert 200x200 integer matrix of 0,1
> > x86 1.1
> > A 3.8 this one IS parallel on iPad! so still 4:3, and no vectors?
> >
> > Create millions of random reals
> > x86 .51s
> > A .42s
> >
> > Create millions of random bits
> > x86 .2s
> > A .09s
> >
> > Create millions of random integers
> > x86 .32s
> > A .23s
> >
> > Calculate Sin and Exp of millions of reals
> > x86 1s
> > A 1.3s (definitely parallelized; shows 4:3 cores…)
> >
> > Calculate ArcTan of millions of reals
> > x86 1s
> > A 3.1s (both still parallelized. bad algorithm on iPad?)
> >
> > Millions of real divisions (~400MB of data, so limited by memory?)
> > x86 1.1s Intel has two dividers?
> > A .7s (!) Apple has three dividers?
> > (both stay the same when divide changed to multiple, so actually memory test?)
> >
> > But if we change the details (now about 3.5MB of data, many more iterations)
> > for multiplications:
> > x86 1.5s
> > A .8s
> > for divisions
> > x86 4.4s
> > A 4.2s
> > So Apple does even better — more multiply units? more but slower divide units?
> >
> > If we do the same sort of thing for integers (fp=N[int1/int2]) we get
> > x86 1.8s
> > A 1.9s
> > (this is mix of int and fp, no parallelism or vectors)
> >
> > Real matrix powers
> > x86 2.6s
> > A 2.8s (this one clearly IS parallelized on both CPUs)
> >
> > Eigenvalues of a large matrix
> > x86 .43s parallelized
> > A 1.5s not parallelized
> >
> > Singular value decomposition
> > x86 .48s
> > A 3.6s (7.5x) not parallelized, (not vectorized?)
> >
> > Solve a linear system
> > x86 .24s
> > A 2.6s (11x) as above
> >
> > Transpose a large (real) matrix (~35MB data)
> > x86 .68s
> > A .55s better memory system
> >
> > Numerical integration
> > x86 .63s
> > A .56s
> >
> > Sort a million integers
> > x86 2.2s
> > A 2.2s
> >
> > Sort a million reals
> > x86 2.4s
> > A 2.6s
> >
> > Expanding large polynomials (with REAL coefficients)
> > x86 .9s
> > A .8s
> >
> > Symbolic integration
> > x86 .91s
> > A .54s
> >
> >
> > I'm actually stunned at how well the Apple chip does. I think those who continue to insist that
> > this is a toy chip incapable of real work are going to be eating their words, and I'm more convinced
> > than ever that Apple has grand dreams ahead of them driven by their CPUs (and GPUs? NPUs?)
> >
> > Like I said, for single-threaded non-vectorized tasks, Apple basically matches x86,
> > in spite of the 3.6/2.35=1.5x frequency boost. For the most memory intensive tasks (large
> > transpose, simple calculations over very long vectors) we're seeing a superior memory
> > system (and that's before the spectacular A11 improvements in this regard).
> >
> > It's mildly amusing that, even though Apple has less absolute FP capacity than Intel (3
> > 128wide units vs 2 256wide units) for code that operates on a scalar at a time, Apple can
> > run the three units simultaneously, while Intel can only run two simultaneously. Hence
> > we see the divergence for the "now about 3.5MB of data, many more iterations" loops.
> >
> > Obviously there is, to repeat the point one again, missing algorithmic functionality in the form of
> > - any? vector support
> > - very little parallelism support
> > - terrible multi-precision handling
> >
> > But the baseline hardware is remarkable!
> > Obviously Intel has another 50% or so of reserve capacity in the form of running at 5GHz or so, and
> > various minor IPC boosts from Ivy Bridge to Koby Lake. Meanwhile Apple has (apparently) a ~25% boost
> > in the A11 and while there will clearly be some code that will not scale up (anything that's basically
> > already #fp units * frequency) there's probably enough stuff that's still throttled by memory or branch
> > that it will pick up value from that IPC boost. And while it's probably unreasonable to assume that
> > Apple could crank up an A11 to 5GHz, they likely could crank it up to 3GHz at acceptable power levels
> > to pick up that remaining 25%, and so match Intel for desktop Mathematica?
> >
> > Certainly I think it's time to put to rest this on-going nonsense that "GeekBench is all in cache,
> > but Intel slays when you go beyond cache". Most of the problems that I benchmarked (except in
> > cases where I was specifically trying to test in-cache performance) were larger to substantially
> > larger than either device's cache (generally 10s to 100s of MB), and likewise Mathematica's main
> > execution loop is a fairly hefty chunk of code, especially
> > for things like the polynomial expansion/simplification,
> > and symbolic integration, substantially beyond I-cache size.
>
> No, the formatting still sucks.
> OK, well I have better things to do than work around a system that insists on displaying
> tabs and multiple spaces ONE WAY when you're editing, and then in a completely different
> way when you're presenting the page, and whose "Code" setting doesn't seem to understand
> that "HONORING THE DAMN NUMBER OF SPACES IS THE SINGLE MOST IMPORTANT JOB YOU HAVE!!!"
>
> Sorry guys, you have to read it as randomly formatted crap.
>
>
> Data fitting:
> x86 0.44s
> A 0.54s
>
> 10^6 Digits of Pi:
> x86 0.35s
> A 4.6s (13x)
> (float multiprecision)
>
> Gamma function (factorials of five numbers near 80 000):
> x86 .042
> A .66 (16x)
> (int multiprecision)
>
> Large integer multiplication
> x86 .12
> A 2.2 (18x)
>
> Discrete Fourier transform (of floats):
> x86 .36 (runs the FT parallel, but not the Do)
> A 2.1 (does NOT run the FT parallel)
> (and parallel kernels not supported, so ParallelDo[] fails)
>
>
> Create a large (4000x4000 element square) random real matrix
> x86: .19
> A .17
>
> Invert a large (1000x1000) random real matrix
> x86 .066 (runs in parallel)
> A .52 (8x) does NOT run in parallel
>
> A.B.A^-1 (large real matrix arithmetic)
> x86 .11
> A 1.5 (13x) once again no parallelism (and no vectors?)
>
> Invert 200x200 integer matrix of 0,1
> x86 1.1
> A 3.8 this one IS parallel on iPad! so still 4:3, and no vectors?
>
> Create millions of random reals
> x86 .51s
> A .42s
>
> Create millions of random bits
> x86 .2s
> A .09s
>
> Create millions of random integers
> x86 .32s
> A .23s
>
> Calculate Sin and Exp of millions of reals
> x86 1s
> A 1.3s (definitely parallelized; shows 4:3 cores…)
>
> Calculate ArcTan of millions of reals
> x86 1s
> A 3.1s (both still parallelized. bad algorithm on iPad?)
>
> Millions of real divisions (~400MB of data, so limited by memory?)
> x86 1.1s Intel has two dividers?
> A .7s (!) Apple has three dividers?
> (both stay the same when divide changed to multiple, so actually memory test?)
>
> But if we change the details (now about 3.5MB of data, many more iterations)
> for multiplications:
> x86 1.5s
> A .8s
> for divisions
> x86 4.4s
> A 4.2s
> So Apple does even better — more multiply units? more but slower divide units?
>
> If we do the same sort of thing for integers (fp=N[int1/int2]) we get
> x86 1.8s
> A 1.9s
> (this is mix of int and fp, no parallelism or vectors)
>
> Real matrix powers
> x86 2.6s
> A 2.8s (this one clearly IS parallelized on both CPUs)
>
> Eigenvalues of a large matrix
> x86 .43s parallelized
> A 1.5s not parallelized
>
> Singular value decomposition
> x86 .48s
> A 3.6s (7.5x) not parallelized, (not vectorized?)
>
> Solve a linear system
> x86 .24s
> A 2.6s (11x) as above
>
> Transpose a large (real) matrix (~35MB data)
> x86 .68s
> A .55s better memory system
>
> Numerical integration
> x86 .63s
> A .56s
>
> Sort a million integers
> x86 2.2s
> A 2.2s
>
> Sort a million reals
> x86 2.4s
> A 2.6s
>
> Expanding large polynomials (with REAL coefficients)
> x86 .9s
> A .8s
>
> Symbolic integration
> x86 .91s
> A .54s
>


Interesting numbers, but can't you run this on a single thread on x86 and Apple if so many of them are not using the same parallelism?
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Mathematica on iPadMaynard Handley2017/10/20 01:34 AM
  Mathematica on iPaddmcq2017/10/20 07:26 AM
    Mathematica on iPadMaynard Handley2017/10/20 01:41 PM
      Mathematica on iPadMaynard Handley2017/10/20 08:16 PM
        Does this give better formatting?Maynard Handley2017/10/20 08:20 PM
          Does this give better formatting?anon2017/10/20 09:37 PM
            Does this give better formatting?Maynard Handley2017/10/20 10:29 PM
              Does this give better formatting?anon2017/10/21 12:52 AM
                Does this give better formatting?Maynard Handley2017/10/21 09:48 AM
                  Does this give better formatting?anon2017/10/21 10:01 AM
        Mathematica on iPadAdrian2017/10/21 01:49 AM
          Sorry for the typoAdrian2017/10/21 01:51 AM
          Mathematica on iPaddmcq2017/10/21 07:03 AM
            Mathematica on iPadMaynard Handley2017/10/21 09:58 AM
          Mathematica on iPadWilco2017/10/21 07:16 AM
            Mathematica on iPadDoug S2017/10/21 09:02 AM
              Mathematica on iPadMegol2017/10/22 05:24 AM
            clang __builtin_addcllMichael S2017/10/21 11:05 AM
          Mathematica on iPadMaynard Handley2017/10/21 09:55 AM
  Mathematica on iPadAnon2017/10/21 04:20 PM
    Mathematica on iPadMaynard Handley2017/10/21 05:51 PM
      Mathematica on iPadAnon2017/10/21 09:56 PM
        Mathematica on iPadMaynard Handley2017/10/22 12:23 AM
      A quick search shows that Mathematica is using Intel MKLGabriele Svelto2017/10/21 11:38 PM
        A quick search shows that Mathematica is using Intel MKLAnon2017/10/22 05:12 PM
          A quick search shows that Mathematica is using Intel MKLMaynard Handley2017/10/22 06:08 PM
            A quick search shows that Mathematica is using Intel MKLDoug S2017/10/22 10:40 PM
            A quick search shows that Mathematica is using Intel MKLMichael S2017/10/23 05:32 AM
  Mathematica on iPadnone2017/10/22 06:06 AM
    Mathematica on iPaddmcq2017/10/23 03:43 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell avocado?