By: Exophase (exophase.delete@this.gmail.com), May 17, 2013 8:52 am
Room: Moderated Discussions
David Kanter (dkanter.delete@this.realworldtech.com) on May 17, 2013 8:00 am wrote:
> That's not trickery, that's life. Intel has better process technology and is able to
> hit higher clock speeds. Moreover, there are many A15 implementations that are incredibly
> power hungry. This shouldn't surprise anyone, since the A15 started out as a server
> core...but then something happened and ARM tried to shove it into mobiles.
>
First of all, what many A15 implementations? There are two in products, Exynos 5250 on Samsung's 32nm and Exynos 5410 on Samsung's 28nm. So far no one has done much work in characterizing perf/W over several intermediate points, all we've seen is peak perf vs peak power. Which paints an incomplete picture.
Second, do you have any support for this claim that A15 was designed for servers and repurposed for mobiles? The idea that a 32-bit core was designed specifically for servers seems odd to me, especially when coupled with ARM's own forecasts for server market share during this time period which were pretty modest.
You could make a stronger argument that it was designed for tablets over phones. A15 can easily achieve better perf/W than Bobcat (and probably Jaguar).
> Certainly there are architectural techniques that can have a big impact (I think the A7 omitting
> a branch predictor is particularly brilliant in that regard), but process has a bigger influence.
I don't know where you get your information; you should read the A7 TRM. A7 has branch prediction consisting of: a four entry BTIC (which avoids the pipeline bubble and at least will hit for loops), an eight entry BTB (they call it BTAC) used exclusively for indirect branches (something A8 and A9 lack), an 8-bit history (256 entry GHB), and an 8-entry return stack. The buffers are very small even compared to A8 but they somewhat accommodate for this by having more specialized logic in place. The small GHB will reduce branch prediction accuracy all around but the impact of the small BTIC will depend on how much of a miss penalty that has; if it's something like two cycles that's not that bad but if it's more like four that could hurt.
> I claim BS already. If A15 is so good, why do partial register stalls cause a massive drop in performance for
> Neon? Oh right, maybe it's because someone made a stupid architectural decision they fixed in the A57.
I haven't heard about this and I'll have to do some tests to see what the impact is. I do know I alias 64-bit and 128-bit registers pretty heavily and I've at least never witnessed NEON performance worse per clock than it is on A8 (and A9, haven't tested) where that architectural decision fits well.
I've heard that a single 128-entry PRF is used for both integer and NEON/VFP registers, making it faster to go from NEON to ARM registers which was a big glass jaw in A8. IMO this alone doesn't sound like something worth optimizing for at much expensive to anything else, but I don't know the real tradeoffs involved. But if it's true, and ARM and NEON registers have no problem aliasing to each other, then I don't see why 64-bit and 128-bit NEON registers would have problems with aliasing instead.
If it is true is it really worse than x86's partial register issues with 8-bit accesses, which ding various uarchs (but not all of them)? You sound eager to single ARM out.
> That's not trickery, that's life. Intel has better process technology and is able to
> hit higher clock speeds. Moreover, there are many A15 implementations that are incredibly
> power hungry. This shouldn't surprise anyone, since the A15 started out as a server
> core...but then something happened and ARM tried to shove it into mobiles.
>
First of all, what many A15 implementations? There are two in products, Exynos 5250 on Samsung's 32nm and Exynos 5410 on Samsung's 28nm. So far no one has done much work in characterizing perf/W over several intermediate points, all we've seen is peak perf vs peak power. Which paints an incomplete picture.
Second, do you have any support for this claim that A15 was designed for servers and repurposed for mobiles? The idea that a 32-bit core was designed specifically for servers seems odd to me, especially when coupled with ARM's own forecasts for server market share during this time period which were pretty modest.
You could make a stronger argument that it was designed for tablets over phones. A15 can easily achieve better perf/W than Bobcat (and probably Jaguar).
> Certainly there are architectural techniques that can have a big impact (I think the A7 omitting
> a branch predictor is particularly brilliant in that regard), but process has a bigger influence.
I don't know where you get your information; you should read the A7 TRM. A7 has branch prediction consisting of: a four entry BTIC (which avoids the pipeline bubble and at least will hit for loops), an eight entry BTB (they call it BTAC) used exclusively for indirect branches (something A8 and A9 lack), an 8-bit history (256 entry GHB), and an 8-entry return stack. The buffers are very small even compared to A8 but they somewhat accommodate for this by having more specialized logic in place. The small GHB will reduce branch prediction accuracy all around but the impact of the small BTIC will depend on how much of a miss penalty that has; if it's something like two cycles that's not that bad but if it's more like four that could hurt.
> I claim BS already. If A15 is so good, why do partial register stalls cause a massive drop in performance for
> Neon? Oh right, maybe it's because someone made a stupid architectural decision they fixed in the A57.
I haven't heard about this and I'll have to do some tests to see what the impact is. I do know I alias 64-bit and 128-bit registers pretty heavily and I've at least never witnessed NEON performance worse per clock than it is on A8 (and A9, haven't tested) where that architectural decision fits well.
I've heard that a single 128-entry PRF is used for both integer and NEON/VFP registers, making it faster to go from NEON to ARM registers which was a big glass jaw in A8. IMO this alone doesn't sound like something worth optimizing for at much expensive to anything else, but I don't know the real tradeoffs involved. But if it's true, and ARM and NEON registers have no problem aliasing to each other, then I don't see why 64-bit and 128-bit NEON registers would have problems with aliasing instead.
If it is true is it really worse than x86's partial register issues with 8-bit accesses, which ding various uarchs (but not all of them)? You sound eager to single ARM out.