By: Wilco (Wilco.Dijkstra.delete@this.ntlworld.com), May 19, 2013 5:19 am
Room: Moderated Discussions
David Kanter (dkanter.delete@this.realworldtech.com) on May 17, 2013 9:38 pm wrote:
> > > power hungry. This shouldn't surprise anyone, since the A15 started out as a server
> > > core...but then something happened and ARM tried to shove it into mobiles.
> > >
> >
> > First of all, what many A15 implementations? There are two in products, Exynos 5250 on Samsung's 32nm and
> > Exynos 5410 on Samsung's 28nm. So far no one has done much
> > work in characterizing perf/W over several intermediate
> > points, all we've seen is peak perf vs peak power. Which paints an incomplete picture.
>
> Intel's comparisons were peak to peak and then also normalized to 1.5W total power for the CPU blocks.
>
> > Second, do you have any support for this claim that A15
> > was designed for servers and repurposed for mobiles?
>
> It's something I've heard from half a dozen people. ALthough I've also heard contrary evidence (recently).
What about reading ARM's press release when A15 was originally announced, ARM's A15 product pages or any articles about A15? Not a single resource claims that A15 is server-only. I'm not sure who is trying to claim that (your Intel contacts?), but it is just BS, and frankly you should know better than to fall for it.
> > The idea that a 32-bit core was designed specifically for servers seems odd to me, especially when coupled
> > with ARM's own forecasts for server market share during this time period which were pretty modest.
> >
> > You could make a stronger argument that it was designed for tablets over phones.
> > A15 can easily achieve better perf/W than Bobcat (and probably Jaguar).
>
> It's quite possible that the design point was tablets+servers rather than phones.
Still incorrect - tablets aren't mentioned anywhere in any A15 announcement. Originally the design point was 1-1.5GHz single/dual cores for phones, 1.5-2.GHz dual core for home entertainment, and 1.5-2.5GHz quad/octal core servers. That's still on the A15 product pages.
Obviously quad-core smartphones running close to 2GHz in 2013 is a bit more than ARM originally expected...
> > > Certainly there are architectural techniques that can have a big impact (I think the A7 omitting
> > > a branch predictor is particularly brilliant in that regard), but process has a bigger influence.
> >
> > I don't know where you get your information; you should
> > read the A7 TRM. A7 has branch prediction consisting
> > of: a four entry BTIC (which avoids the pipeline bubble
> > and at least will hit for loops), an eight entry BTB
> > (they call it BTAC) used exclusively for indirect branches
> > (something A8 and A9 lack), an 8-bit history (256
> > entry GHB), and an 8-entry return stack. The buffers are very small even compared to A8 but they somewhat
> > accommodate for this by having more specialized logic in place. The small GHB will reduce branch prediction
> > accuracy all around but the impact of the small BTIC will depend on how much of a miss penalty that has; if
> > it's something like two cycles that's not that bad but if it's more like four that could hurt.
>
> You're right I overstated that. I talked with the lead A7 architect about this, and he mentioned the
> smaller branch prediction structures as one of the key techniques that enabled them to reduce size.
The reason why A7 doesn't need a BTB for direct branches is that it can compute the destination of a branch during fetch. However 256-entry GHB seems a bit small (last used on ARM11!) despite the shallow pipeline and low branch mispredict penalty. Compared to 32 or 64KB total I+D caches even quadrupling the GHB to 2048 bits would seem trivial.
> > > I claim BS already. If A15 is so good, why do partial register
> > > stalls cause a massive drop in performance for
> > > Neon? Oh right, maybe it's because someone made a stupid architectural decision they fixed in the A57.
> >
> > I haven't heard about this and I'll have to do some tests to see what the impact is. I do know I alias
> > 64-bit and 128-bit registers pretty heavily and I've at least never witnessed NEON performance worse
> > per clock than it is on A8 (and A9, haven't tested) where that architectural decision fits well.
>
> How tightly intermingled are your 64b and 128b accesses? It's also possible the problem is with writes
> to one of the 4 data elements within a neon register. I didn't exactly get code samples here.
Neon processes things in multiples of 64 bits so I've never heard about stalls from mixing double and quad registers. However 32-bit writes are a bad idea if you read the register as 64 bits afterwards:
vmov s0,r0
vmov s1,r1
vadd d0,d0,d1
The vmov d0,r0,r1 instruction was added to Archv5TE so that this case would not cause stalls. Badly written code could certainly cause stalls on most ARMs, but the idea is to avoid writing such code...
> > If it is true is it really worse than x86's partial register issues with 8-bit accesses,
> > which ding various uarchs (but not all of them)? You sound eager to single ARM out.
>
> Yes I am. My point is that ARM's microarchitectures are not perfect, something Wilco doesn't seem to comprehend.
Don't put words in my mouth, I have never said that. The simple fact is A15 is the most advanced low power CPU available right now. That doesn't mean it is perfect - as a compiler writer I know better than most people that every CPU has its peculiarities, and A15 is no different. But using a single (unconfirmed) example of a stall to claim that all Neon code is slower on A15 is ridiculous. So far I've been pleasantly surprised how fast Neon executes on A15, for example it is the first ARM which does 128-bit reads per cycle, even if unaligned.
> It's quite damning that almost every single ARM customer has chosen to avoid the A15 for phones.
I remember you were claiming we will never see quad-core phones when Tegra 3 was announced, and look where we are now - even low-end phones are mostly quad-core... I bet your A15 prediction will be as wrong.
> We already know Intel is quite capable of producing rotten CPU designs (see Atom), but oddly
> enough those still seem to have pretty good performance relative to something like the A9.
Certainly not on single threaded performance. The only redeeming feature of Atom is Hyperthreading which allows it to give the performance equivalent of 1.5-2 cores without HT. And it had better be given its enormous die-size.
Wilco
> > > power hungry. This shouldn't surprise anyone, since the A15 started out as a server
> > > core...but then something happened and ARM tried to shove it into mobiles.
> > >
> >
> > First of all, what many A15 implementations? There are two in products, Exynos 5250 on Samsung's 32nm and
> > Exynos 5410 on Samsung's 28nm. So far no one has done much
> > work in characterizing perf/W over several intermediate
> > points, all we've seen is peak perf vs peak power. Which paints an incomplete picture.
>
> Intel's comparisons were peak to peak and then also normalized to 1.5W total power for the CPU blocks.
>
> > Second, do you have any support for this claim that A15
> > was designed for servers and repurposed for mobiles?
>
> It's something I've heard from half a dozen people. ALthough I've also heard contrary evidence (recently).
What about reading ARM's press release when A15 was originally announced, ARM's A15 product pages or any articles about A15? Not a single resource claims that A15 is server-only. I'm not sure who is trying to claim that (your Intel contacts?), but it is just BS, and frankly you should know better than to fall for it.
> > The idea that a 32-bit core was designed specifically for servers seems odd to me, especially when coupled
> > with ARM's own forecasts for server market share during this time period which were pretty modest.
> >
> > You could make a stronger argument that it was designed for tablets over phones.
> > A15 can easily achieve better perf/W than Bobcat (and probably Jaguar).
>
> It's quite possible that the design point was tablets+servers rather than phones.
Still incorrect - tablets aren't mentioned anywhere in any A15 announcement. Originally the design point was 1-1.5GHz single/dual cores for phones, 1.5-2.GHz dual core for home entertainment, and 1.5-2.5GHz quad/octal core servers. That's still on the A15 product pages.
Obviously quad-core smartphones running close to 2GHz in 2013 is a bit more than ARM originally expected...
> > > Certainly there are architectural techniques that can have a big impact (I think the A7 omitting
> > > a branch predictor is particularly brilliant in that regard), but process has a bigger influence.
> >
> > I don't know where you get your information; you should
> > read the A7 TRM. A7 has branch prediction consisting
> > of: a four entry BTIC (which avoids the pipeline bubble
> > and at least will hit for loops), an eight entry BTB
> > (they call it BTAC) used exclusively for indirect branches
> > (something A8 and A9 lack), an 8-bit history (256
> > entry GHB), and an 8-entry return stack. The buffers are very small even compared to A8 but they somewhat
> > accommodate for this by having more specialized logic in place. The small GHB will reduce branch prediction
> > accuracy all around but the impact of the small BTIC will depend on how much of a miss penalty that has; if
> > it's something like two cycles that's not that bad but if it's more like four that could hurt.
>
> You're right I overstated that. I talked with the lead A7 architect about this, and he mentioned the
> smaller branch prediction structures as one of the key techniques that enabled them to reduce size.
The reason why A7 doesn't need a BTB for direct branches is that it can compute the destination of a branch during fetch. However 256-entry GHB seems a bit small (last used on ARM11!) despite the shallow pipeline and low branch mispredict penalty. Compared to 32 or 64KB total I+D caches even quadrupling the GHB to 2048 bits would seem trivial.
> > > I claim BS already. If A15 is so good, why do partial register
> > > stalls cause a massive drop in performance for
> > > Neon? Oh right, maybe it's because someone made a stupid architectural decision they fixed in the A57.
> >
> > I haven't heard about this and I'll have to do some tests to see what the impact is. I do know I alias
> > 64-bit and 128-bit registers pretty heavily and I've at least never witnessed NEON performance worse
> > per clock than it is on A8 (and A9, haven't tested) where that architectural decision fits well.
>
> How tightly intermingled are your 64b and 128b accesses? It's also possible the problem is with writes
> to one of the 4 data elements within a neon register. I didn't exactly get code samples here.
Neon processes things in multiples of 64 bits so I've never heard about stalls from mixing double and quad registers. However 32-bit writes are a bad idea if you read the register as 64 bits afterwards:
vmov s0,r0
vmov s1,r1
vadd d0,d0,d1
The vmov d0,r0,r1 instruction was added to Archv5TE so that this case would not cause stalls. Badly written code could certainly cause stalls on most ARMs, but the idea is to avoid writing such code...
> > If it is true is it really worse than x86's partial register issues with 8-bit accesses,
> > which ding various uarchs (but not all of them)? You sound eager to single ARM out.
>
> Yes I am. My point is that ARM's microarchitectures are not perfect, something Wilco doesn't seem to comprehend.
Don't put words in my mouth, I have never said that. The simple fact is A15 is the most advanced low power CPU available right now. That doesn't mean it is perfect - as a compiler writer I know better than most people that every CPU has its peculiarities, and A15 is no different. But using a single (unconfirmed) example of a stall to claim that all Neon code is slower on A15 is ridiculous. So far I've been pleasantly surprised how fast Neon executes on A15, for example it is the first ARM which does 128-bit reads per cycle, even if unaligned.
> It's quite damning that almost every single ARM customer has chosen to avoid the A15 for phones.
I remember you were claiming we will never see quad-core phones when Tegra 3 was announced, and look where we are now - even low-end phones are mostly quad-core... I bet your A15 prediction will be as wrong.
> We already know Intel is quite capable of producing rotten CPU designs (see Atom), but oddly
> enough those still seem to have pretty good performance relative to something like the A9.
Certainly not on single threaded performance. The only redeeming feature of Atom is Hyperthreading which allows it to give the performance equivalent of 1.5-2 cores without HT. And it had better be given its enormous die-size.
Wilco