By: David Kanter (dkanter.delete@this.realworldtech.com), May 17, 2013 8:38 pm
Room: Moderated Discussions
> > power hungry. This shouldn't surprise anyone, since the A15 started out as a server
> > core...but then something happened and ARM tried to shove it into mobiles.
> >
>
> First of all, what many A15 implementations? There are two in products, Exynos 5250 on Samsung's 32nm and
> Exynos 5410 on Samsung's 28nm. So far no one has done much work in characterizing perf/W over several intermediate
> points, all we've seen is peak perf vs peak power. Which paints an incomplete picture.
Intel's comparisons were peak to peak and then also normalized to 1.5W total power for the CPU blocks.
> Second, do you have any support for this claim that A15 was designed for servers and repurposed for mobiles?
It's something I've heard from half a dozen people. ALthough I've also heard contrary evidence (recently).
> The idea that a 32-bit core was designed specifically for servers seems odd to me, especially when coupled
> with ARM's own forecasts for server market share during this time period which were pretty modest.
>
> You could make a stronger argument that it was designed for tablets over phones.
> A15 can easily achieve better perf/W than Bobcat (and probably Jaguar).
It's quite possible that the design point was tablets+servers rather than phones.
> > Certainly there are architectural techniques that can have a big impact (I think the A7 omitting
> > a branch predictor is particularly brilliant in that regard), but process has a bigger influence.
>
> I don't know where you get your information; you should read the A7 TRM. A7 has branch prediction consisting
> of: a four entry BTIC (which avoids the pipeline bubble and at least will hit for loops), an eight entry BTB
> (they call it BTAC) used exclusively for indirect branches (something A8 and A9 lack), an 8-bit history (256
> entry GHB), and an 8-entry return stack. The buffers are very small even compared to A8 but they somewhat
> accommodate for this by having more specialized logic in place. The small GHB will reduce branch prediction
> accuracy all around but the impact of the small BTIC will depend on how much of a miss penalty that has; if
> it's something like two cycles that's not that bad but if it's more like four that could hurt.
You're right I overstated that. I talked with the lead A7 architect about this, and he mentioned the smaller branch prediction structures as one of the key techniques that enabled them to reduce size.
> > I claim BS already. If A15 is so good, why do partial register
> > stalls cause a massive drop in performance for
> > Neon? Oh right, maybe it's because someone made a stupid architectural decision they fixed in the A57.
>
> I haven't heard about this and I'll have to do some tests to see what the impact is. I do know I alias
> 64-bit and 128-bit registers pretty heavily and I've at least never witnessed NEON performance worse
> per clock than it is on A8 (and A9, haven't tested) where that architectural decision fits well.
How tightly intermingled are your 64b and 128b accesses? It's also possible the problem is with writes to one of the 4 data elements within a neon register. I didn't exactly get code samples here.
> I've heard that a single 128-entry PRF is used for both integer and NEON/VFP registers, making it faster that,
> to go from NEON to ARM registers which was a big glass jaw in A8. IMO this alone doesn't sound like
> something worth optimizing for at much expensive to anything else, but I don't know the real tradeoffs
> involved. But if it's true, and ARM and NEON registers have no problem aliasing to each other, then
> I don't see why 64-bit and 128-bit NEON registers would have problems with aliasing instead.
>
> If it is true is it really worse than x86's partial register issues with 8-bit accesses,
> which ding various uarchs (but not all of them)? You sound eager to single ARM out.
Yes I am. My point is that ARM's microarchitectures are not perfect, something Wilco doesn't seem to comprehend. It's quite damning that almost every single ARM customer has chosen to avoid the A15 for phones.
We already know Intel is quite capable of producing rotten CPU designs (see Atom), but oddly enough those still seem to have pretty good performance relative to something like the A9.
DK
> > core...but then something happened and ARM tried to shove it into mobiles.
> >
>
> First of all, what many A15 implementations? There are two in products, Exynos 5250 on Samsung's 32nm and
> Exynos 5410 on Samsung's 28nm. So far no one has done much work in characterizing perf/W over several intermediate
> points, all we've seen is peak perf vs peak power. Which paints an incomplete picture.
Intel's comparisons were peak to peak and then also normalized to 1.5W total power for the CPU blocks.
> Second, do you have any support for this claim that A15 was designed for servers and repurposed for mobiles?
It's something I've heard from half a dozen people. ALthough I've also heard contrary evidence (recently).
> The idea that a 32-bit core was designed specifically for servers seems odd to me, especially when coupled
> with ARM's own forecasts for server market share during this time period which were pretty modest.
>
> You could make a stronger argument that it was designed for tablets over phones.
> A15 can easily achieve better perf/W than Bobcat (and probably Jaguar).
It's quite possible that the design point was tablets+servers rather than phones.
> > Certainly there are architectural techniques that can have a big impact (I think the A7 omitting
> > a branch predictor is particularly brilliant in that regard), but process has a bigger influence.
>
> I don't know where you get your information; you should read the A7 TRM. A7 has branch prediction consisting
> of: a four entry BTIC (which avoids the pipeline bubble and at least will hit for loops), an eight entry BTB
> (they call it BTAC) used exclusively for indirect branches (something A8 and A9 lack), an 8-bit history (256
> entry GHB), and an 8-entry return stack. The buffers are very small even compared to A8 but they somewhat
> accommodate for this by having more specialized logic in place. The small GHB will reduce branch prediction
> accuracy all around but the impact of the small BTIC will depend on how much of a miss penalty that has; if
> it's something like two cycles that's not that bad but if it's more like four that could hurt.
You're right I overstated that. I talked with the lead A7 architect about this, and he mentioned the smaller branch prediction structures as one of the key techniques that enabled them to reduce size.
> > I claim BS already. If A15 is so good, why do partial register
> > stalls cause a massive drop in performance for
> > Neon? Oh right, maybe it's because someone made a stupid architectural decision they fixed in the A57.
>
> I haven't heard about this and I'll have to do some tests to see what the impact is. I do know I alias
> 64-bit and 128-bit registers pretty heavily and I've at least never witnessed NEON performance worse
> per clock than it is on A8 (and A9, haven't tested) where that architectural decision fits well.
How tightly intermingled are your 64b and 128b accesses? It's also possible the problem is with writes to one of the 4 data elements within a neon register. I didn't exactly get code samples here.
> I've heard that a single 128-entry PRF is used for both integer and NEON/VFP registers, making it faster that,
> to go from NEON to ARM registers which was a big glass jaw in A8. IMO this alone doesn't sound like
> something worth optimizing for at much expensive to anything else, but I don't know the real tradeoffs
> involved. But if it's true, and ARM and NEON registers have no problem aliasing to each other, then
> I don't see why 64-bit and 128-bit NEON registers would have problems with aliasing instead.
>
> If it is true is it really worse than x86's partial register issues with 8-bit accesses,
> which ding various uarchs (but not all of them)? You sound eager to single ARM out.
Yes I am. My point is that ARM's microarchitectures are not perfect, something Wilco doesn't seem to comprehend. It's quite damning that almost every single ARM customer has chosen to avoid the A15 for phones.
We already know Intel is quite capable of producing rotten CPU designs (see Atom), but oddly enough those still seem to have pretty good performance relative to something like the A9.
DK