By: David Kanter (dkanter.delete@this.realworldtech.com), May 17, 2013 8:29 pm
Room: Moderated Discussions
Wilco (Wilco.Dijkstra.delete@this.ntlworld.com) on May 17, 2013 12:22 pm wrote:
> David Kanter (dkanter.delete@this.realworldtech.com) on May 17, 2013 8:00 am wrote:
> > Wilco (Wilco.Dijkstra.delete@this.ntlworld.com) on May 15, 2013 5:37 pm wrote:
> > > Ashraf Eassa (aeassa.delete@this.gmail.com) on May 15, 2013 11:59 am wrote:
> > > > Hi everybody,
> > > >
> > > > I've been lurking for years, but the time has come when I would really love to pick the brains of
> > > > the experts we have here. From my understanding, Atom is a much narrower design than Krait, Cortex
> > > > A15 and others, and yet, in many benchmarks the older Saltwell core holds its own against even Krait
> > > > in both FPU/INT, and against A15 in Linux integer benchmarks (but it gets decimated in FPU).
> > >
> > > Which Linux benchmarks do you mean? This does not show a single benchmark where dual Atom can keep up
> > > with dual A15. Even Tegra 3 wins 9 out of 11 benchmarks despite its slow single-channel memory system.
> >
> > You realize there are an awful lot more tests out there that Phoronix doesn't run, right?
> >
> > Also, why do we even care about Linux? We care about Android, which is rather distinct from Linux.
>
> Android is just a layer on top of Linux. So yes, we do care about Linux performance using GCC on mainstream
> code. How many tricks ICC uses to get great SPEC results is irrelevant in the Linux/Android world.
Android applications don't really use GCC as I understand it. They are using Dalvik.
So you're comparing two entirely different software stacks and trying to draw conclusions. Moreover, my understanding is that Moorestown actually matches the A9 and A15 on quite a few benchmarks (based on discussions with Anand).
> > > So, my question is, how do I think about "Silvermont" competitive position against a fairly
> > > > beefy modern ARM design such as the Cortex A15? From a high level perspective, it looks
> > > > like on a per-clock basis it should be no contest - A15 is wider and more aggressive.
> > > > But Intel is claiming that Silvermont is as fast as A15 on a per-clock basis.
> > >
> > > "Intel is claiming" - there is your hint... When Atom originally was announced, it was supposed
> > > to be 5-6 times faster than ARM cores. However when Atom was finally available in phones, it
> > > actually lagged in performance. This is where Atom is today. Is that competitive?
> >
> > When did Intel claim that Atom would be 5-6x faster than ARM cores? And which
> > ARM cores? I'd like to see some proof, because that just sounds crazy.
>
> Here is what Intel claimed at the time. Yep crazy stuff indeed. They compared the fastest Atom against
> a low frequency ARM11, despite much faster versions being available (IIRC 750MHz), as well as 600/800MHz
> Cortex-A8. Remember the "only x86 gives the full web browsing experience" slogans?
It was a stupid comparison, you're right. But I wouldn't say it was incorrect. I'm quite sure that Silverthorne beats the OMAP2 silly. Of course, the OMAP2 has been irrelelvant for ages. Designing a car faster than a horse-drawn buggy isn't exactly impressive.
And it was a very specific comparison, which I suspect is true. Intel's marketing is aggressive, but if you question them carefully, you'll ALWAYS get the fine print. Of course, most journalists are not technically savvy enough to do so. And in this case, the fine print said: "We are winning a comparison which is a pointless comparison".
> > > > A couple of questions then:
> > > >
> > > > 1. How can a narrower design pull this off?
> > >
> > > It doesn't. Not without trickery anyway - like comparing a highly clocked CPU against a low
> > > clocked one,
> >
> > That's not trickery, that's life. Intel has better process technology and is able to
> > hit higher clock speeds.
>
> It's trickery when you use a slow CPU on purpose when much faster CPUs are available.
Intel's comparisons vs. A15 were normalized to power and/or specific devices. That's a very fair equalization point. Far more sensible than equal frequency.
> And in terms of frequency ARM has caught up dramatically in recent years. I expect ARM
> to pull ahead in frequency with Tegra 4i, 20nm A15's and the first 64-bit ARMs.
That's not remotely true. The marketing department for Samsung and Nvidia has caught up. Most of these devices are rated for frequencies that cannot be sustained for more than a few seconds at best, and still result in ridiculous power consumption. The frequency for Atom is also suspect, but frankly Intel lies less about frequency than other companies and has vastly better DVFS implementations.
> > Moreover, there are many A15 implementations that are incredibly
> > power hungry. This shouldn't surprise anyone, since the A15 started out as a server
> > core...but then something happened and ARM tried to shove it into mobiles.
>
> I call BS on that - ARM has never said that A15 is a server-only CPU, it has always been designed for mobile/tablets
> but with added server extensions. Here is Anands first article on A15, even the title is clear.
I have been told by several people that A15 was internally started aimed at servers and later repositioned. That being said, I've been told that isn't true by other people.
> > Clock-normalized comparisons are useful as thinking points, but you really need to consider physical design
> > and process technology. Power and frequency are intrinsically tied to physical design and process, as is
> > area. Certainly there are architectural techniques that can have a big impact (I think the A7 omitting
> > a branch predictor is particularly brilliant in that regard), but process has a bigger influence.
>
> If process was the only thing that mattered then how could Calxeda server nodes
> possibly beat Atom on both performance and power using an old 40nm process?
It's not the only thing that matters, but process is generally a bigger deal than microarchitecture. More to the point, clock-normalized comparisons are stupid. Power normalized actually makes sense. And when you compare power, process REALLY matters. Especially for something like 22nm FinFETs vs. 28nm bulk.
> > >comparing an unreleased CPU against a much older CPU, using different compiler
> > > versions or optimizing for specific benchmarks (SunSpider). It's called "benchmarketing"...
> >
> > > About the only area where Silvermont appears to have an
> > > advantage over A15 is a lower L2 latency. Everything
> > > else is like you said, smaller buffers, narrower, simpler
> > > and less aggressive. Given the memory system advantage
> > > I'd expect it to beat A9 by a good margin (although A9R4 might well be competitive). However based on what
> > > we know you'd have to be extremely optimistic to believe it can get even close to A15 performance.
> >
> > I claim BS already. If A15 is so good, why do partial register
> > stalls cause a massive drop in performance for
> > Neon? Oh right, maybe it's because someone made a stupid architectural decision they fixed in the A57.
>
> Do you have any evidence for that? Partial register stalls are rare
> on ARM, I don't believe they happen in common cases, unlike x86.
I had a long conversation with a friend (employed at a key ARM partner/customer) on this topic. When they turned on Neon, the performance dropped significantly, which they traced back to partial register accesses and the aliased register files. It's partially a result of how the NEON pipeline integrates with the CPU and writes results back to registers and when they merge partial register writes.
Bottom line: ARM has made plenty of stupid design choices. Their designs are hardly perfect and they are learning.
David
> David Kanter (dkanter.delete@this.realworldtech.com) on May 17, 2013 8:00 am wrote:
> > Wilco (Wilco.Dijkstra.delete@this.ntlworld.com) on May 15, 2013 5:37 pm wrote:
> > > Ashraf Eassa (aeassa.delete@this.gmail.com) on May 15, 2013 11:59 am wrote:
> > > > Hi everybody,
> > > >
> > > > I've been lurking for years, but the time has come when I would really love to pick the brains of
> > > > the experts we have here. From my understanding, Atom is a much narrower design than Krait, Cortex
> > > > A15 and others, and yet, in many benchmarks the older Saltwell core holds its own against even Krait
> > > > in both FPU/INT, and against A15 in Linux integer benchmarks (but it gets decimated in FPU).
> > >
> > > Which Linux benchmarks do you mean? This does not show a single benchmark where dual Atom can keep up
> > > with dual A15. Even Tegra 3 wins 9 out of 11 benchmarks despite its slow single-channel memory system.
> >
> > You realize there are an awful lot more tests out there that Phoronix doesn't run, right?
> >
> > Also, why do we even care about Linux? We care about Android, which is rather distinct from Linux.
>
> Android is just a layer on top of Linux. So yes, we do care about Linux performance using GCC on mainstream
> code. How many tricks ICC uses to get great SPEC results is irrelevant in the Linux/Android world.
Android applications don't really use GCC as I understand it. They are using Dalvik.
So you're comparing two entirely different software stacks and trying to draw conclusions. Moreover, my understanding is that Moorestown actually matches the A9 and A15 on quite a few benchmarks (based on discussions with Anand).
> > > So, my question is, how do I think about "Silvermont" competitive position against a fairly
> > > > beefy modern ARM design such as the Cortex A15? From a high level perspective, it looks
> > > > like on a per-clock basis it should be no contest - A15 is wider and more aggressive.
> > > > But Intel is claiming that Silvermont is as fast as A15 on a per-clock basis.
> > >
> > > "Intel is claiming" - there is your hint... When Atom originally was announced, it was supposed
> > > to be 5-6 times faster than ARM cores. However when Atom was finally available in phones, it
> > > actually lagged in performance. This is where Atom is today. Is that competitive?
> >
> > When did Intel claim that Atom would be 5-6x faster than ARM cores? And which
> > ARM cores? I'd like to see some proof, because that just sounds crazy.
>
> Here is what Intel claimed at the time. Yep crazy stuff indeed. They compared the fastest Atom against
> a low frequency ARM11, despite much faster versions being available (IIRC 750MHz), as well as 600/800MHz
> Cortex-A8. Remember the "only x86 gives the full web browsing experience" slogans?
It was a stupid comparison, you're right. But I wouldn't say it was incorrect. I'm quite sure that Silverthorne beats the OMAP2 silly. Of course, the OMAP2 has been irrelelvant for ages. Designing a car faster than a horse-drawn buggy isn't exactly impressive.
And it was a very specific comparison, which I suspect is true. Intel's marketing is aggressive, but if you question them carefully, you'll ALWAYS get the fine print. Of course, most journalists are not technically savvy enough to do so. And in this case, the fine print said: "We are winning a comparison which is a pointless comparison".
> > > > A couple of questions then:
> > > >
> > > > 1. How can a narrower design pull this off?
> > >
> > > It doesn't. Not without trickery anyway - like comparing a highly clocked CPU against a low
> > > clocked one,
> >
> > That's not trickery, that's life. Intel has better process technology and is able to
> > hit higher clock speeds.
>
> It's trickery when you use a slow CPU on purpose when much faster CPUs are available.
Intel's comparisons vs. A15 were normalized to power and/or specific devices. That's a very fair equalization point. Far more sensible than equal frequency.
> And in terms of frequency ARM has caught up dramatically in recent years. I expect ARM
> to pull ahead in frequency with Tegra 4i, 20nm A15's and the first 64-bit ARMs.
That's not remotely true. The marketing department for Samsung and Nvidia has caught up. Most of these devices are rated for frequencies that cannot be sustained for more than a few seconds at best, and still result in ridiculous power consumption. The frequency for Atom is also suspect, but frankly Intel lies less about frequency than other companies and has vastly better DVFS implementations.
> > Moreover, there are many A15 implementations that are incredibly
> > power hungry. This shouldn't surprise anyone, since the A15 started out as a server
> > core...but then something happened and ARM tried to shove it into mobiles.
>
> I call BS on that - ARM has never said that A15 is a server-only CPU, it has always been designed for mobile/tablets
> but with added server extensions. Here is Anands first article on A15, even the title is clear.
I have been told by several people that A15 was internally started aimed at servers and later repositioned. That being said, I've been told that isn't true by other people.
> > Clock-normalized comparisons are useful as thinking points, but you really need to consider physical design
> > and process technology. Power and frequency are intrinsically tied to physical design and process, as is
> > area. Certainly there are architectural techniques that can have a big impact (I think the A7 omitting
> > a branch predictor is particularly brilliant in that regard), but process has a bigger influence.
>
> If process was the only thing that mattered then how could Calxeda server nodes
> possibly beat Atom on both performance and power using an old 40nm process?
It's not the only thing that matters, but process is generally a bigger deal than microarchitecture. More to the point, clock-normalized comparisons are stupid. Power normalized actually makes sense. And when you compare power, process REALLY matters. Especially for something like 22nm FinFETs vs. 28nm bulk.
> > >comparing an unreleased CPU against a much older CPU, using different compiler
> > > versions or optimizing for specific benchmarks (SunSpider). It's called "benchmarketing"...
> >
> > > About the only area where Silvermont appears to have an
> > > advantage over A15 is a lower L2 latency. Everything
> > > else is like you said, smaller buffers, narrower, simpler
> > > and less aggressive. Given the memory system advantage
> > > I'd expect it to beat A9 by a good margin (although A9R4 might well be competitive). However based on what
> > > we know you'd have to be extremely optimistic to believe it can get even close to A15 performance.
> >
> > I claim BS already. If A15 is so good, why do partial register
> > stalls cause a massive drop in performance for
> > Neon? Oh right, maybe it's because someone made a stupid architectural decision they fixed in the A57.
>
> Do you have any evidence for that? Partial register stalls are rare
> on ARM, I don't believe they happen in common cases, unlike x86.
I had a long conversation with a friend (employed at a key ARM partner/customer) on this topic. When they turned on Neon, the performance dropped significantly, which they traced back to partial register accesses and the aliased register files. It's partially a result of how the NEON pipeline integrates with the CPU and writes results back to registers and when they merge partial register writes.
Bottom line: ARM has made plenty of stupid design choices. Their designs are hardly perfect and they are learning.
David