Article: AMD's Mobile Strategy
By: Exophase (exophase.delete@this.gmail.com), December 22, 2011 9:27 pm
Room: Moderated Discussions
David Kanter (dkanter@realworldtech.com) on 12/22/11 wrote:
---------------------------
>I think it's tough to discuss only a subset of the vector extensions. It's true
>that AVX might be irrelevant for phones, today. But I doubt it will stay that way
>forever and you might see direct comparisons in tablets.
It's true in the kind of tablets most people are interested in buying (not those running CPUs with > 10W TDP). It's not a foregone conclusion that AVX will ever hit these CPUs, but even if it is it's pretty moot. We don't know what'll be out to compare it with when it finally does. I'd probably give it more consideration if Atom sported SSE4.x which first appeared years ago.
>>If we want to keep it to just load/store behavior: NEON in ARMv7a can load or store
>>1-4 64-bit registers. Loads can perform de-interleaving and you can load single
>>elements (lanes in NEON parlance) instead of entire >vectors.
>
>Do you mask out the lanes?
You can load to single lanes or store single lanes. You can't a variable set of lanes against a dynamic mask, which would be a very nice feature and one I hope ARM adopts. Given how poor main RAM performance is on a lot of platforms you can spend a lot of extra time doing an RMW to do a partial update.
>You're still using a single address right? It looks like they probably handle cache line crossing loads, which is nice.
Yes, there's no gather/scatter support. There are 32-index byte shuffles ala XOP, although without all the extra features. And on Cortex-A8/A9 they're pretty slow at that.
You can specify aligned or unaligned for vector memory accesses. Accesses specified as aligned that aren't will fault. On Cortex-A8/A9 unaligned NEON accesses are handled by splitting them into two partial accesses. Which shows the hardware CAN support partial stores, or at least missing chunks from either end if not necessarily both. I don't think the penalties for crossing cache lines go beyond what you'd go for two distinct accesses.
One annoying gotcha I found on these processors is that NEON has no store to load forwarding (main integer unit does). Normally this isn't a huge deal if you're careful, but it hits you unknowingly if you perform an unaligned store following by an unaligned load to the address immediately after it, thanks to the partial accesses overlapping. If you're performing an RMW loop you pay for it pretty dearly, if you can't keep the loads ahead of the stores.
>I seem to recall from talking with one of the architects at ARM that they really
>got rid of all >2 cycle load/stores. I think the main motivation was simplifying
>exceptions, pipeline control and consistency.
>
>That's probably OK because ISTR they now have 128b registers to deal with double precision.
It already has 128-bit registers. And while floating point ops are 2x32-bit per cycle on A8/A9 the more common integer ops are 128-bit/cycle. Loads and stores are too, although I've hit some other bottlenecks with them (but on A8 the dcache interface for NEON is certifiably not 64-bit. Don't know about A9.)
So you can perform 2 128-bit loads/stores (hence 4 64-bit) in 2 cycles. But it has to be aligned, or it's three cycles. And some of the more complex operations with them cost a fourth cycle.
For some reason I thought the v8 overview didn't make things clear, but I looked at it again. 4-vector loads and stores are still present, and have been extended to allow 64 or 128-bit registers. So unless the v8 supporting processor the architect was referring to has a 256-bit or multiple 128-bit dcache interfaces it's going to take more than two cycles to do those instructions.
Other than that it looks like they're mostly the same - somewhat bizarrely including the alternating register options. Possibly motivated by compatibility, although with everything else they'll be breaking with NEON it's hard to see that being a huge factor..
They also stick to the highly limiting memory address modes. Strictly [reg] with a sizeof increment or an increment by another reg. That's it. It's pretty understandable since they really don't have encode space to play with but it's still a pretty big pain (fortunately Cortex-A8 tends to be at its best when doing ARM and NEON code simultaneously so it doesn't tend to hurt much from ARM generating addresses.. no idea about Cortex-A9)
>
>It's all a matter of degrees. AVX will eventually hit low power chips, it's a question
>of when. Also, you will see VFP/Neon chips that are hitting higher power levels.
>So I think the comparison is quite informative.
>
>I'd liken it to Bobcat vs. Atom. They are really different designs, aimed at different
>markets. They overlap in some areas, thus there is room for comparisons. But we
>all know that Atom doesn't hit 20W and Bobcat doesn't go under 4.5W. So they each have 'unique' areas.
>
But with Atom and Bobcat Intel actually is selling Atoms with much higher TDP than the lowest power Bobcats. Bobcat goes a little higher and Atom goes much lower but there's a lot of overlap, and more significantly, those lower power Atoms aren't really being used anywhere right now. So it's a pretty meaningful comparison.
On the flip side, we have no idea just how long it'll take until we see convergence between power usage of AVX CPUs and NEON CPUs. But I definitely think it'll come with IB or maybe Haswell first, not Atom.
>Brands don't matter much, price points do. There are SNBs available for around $100. I'd expect that to drop with IVB.
>
That's far, far too high of a price point to enter this discussion. I'm going to wait and see on IB, current rumor is same price as SB for higher end (which seems like a good move)
>Besides, with Jen-Hsun Huang talking about how Kal-El is faster than Core2...it's
>not crazy to compare it against Sandy Bridge : P
>
Let's keep nVidia's fraudulent marketing out of this ;)
>It's unrealistic to expect ALL of Intel's products to have AVX 1 year after the
>first SNB. There are still folks who want less expensive parts.
>
>David
This isn't about expecting all of Intel's products to have AVX, it's about expecting Intel's CPUs with AVX to have it enabled. Intel's actions are not only unreasonable but completely unprecedented. Coppermine-128 had SSE, P4 Celerons had SSE2, Prescott Celerons had SSE3, etc, even though these were low end and not well performing chips - and they had them before the much better competition from AMD did (or maybe that had something to do with it?).
This generation Intel went crazy with market segmentation feature fusing. I understand how it works and why it works in general and I'm not arguing against performing any feature removal or underclocking or whatever to make other parts look better and sell for more. But some of the decisions seem pulled from a hat. Fusing off instruction sets that are not special purpose is a bad idea because it discourages adoption which is needed to make the instruction set useful.
Processors like Pentium and even Celeron are higher margin than processors like Atom - or at least Atom's going to have to become lower margin to survive. So long as Intel sees AVX as only being worthy of paying highest margin for I can't see them putting it in its lowest margin chips. Won't people wonder why Atoms have it if Pentiums don't?
---------------------------
>I think it's tough to discuss only a subset of the vector extensions. It's true
>that AVX might be irrelevant for phones, today. But I doubt it will stay that way
>forever and you might see direct comparisons in tablets.
It's true in the kind of tablets most people are interested in buying (not those running CPUs with > 10W TDP). It's not a foregone conclusion that AVX will ever hit these CPUs, but even if it is it's pretty moot. We don't know what'll be out to compare it with when it finally does. I'd probably give it more consideration if Atom sported SSE4.x which first appeared years ago.
>>If we want to keep it to just load/store behavior: NEON in ARMv7a can load or store
>>1-4 64-bit registers. Loads can perform de-interleaving and you can load single
>>elements (lanes in NEON parlance) instead of entire >vectors.
>
>Do you mask out the lanes?
You can load to single lanes or store single lanes. You can't a variable set of lanes against a dynamic mask, which would be a very nice feature and one I hope ARM adopts. Given how poor main RAM performance is on a lot of platforms you can spend a lot of extra time doing an RMW to do a partial update.
>You're still using a single address right? It looks like they probably handle cache line crossing loads, which is nice.
Yes, there's no gather/scatter support. There are 32-index byte shuffles ala XOP, although without all the extra features. And on Cortex-A8/A9 they're pretty slow at that.
You can specify aligned or unaligned for vector memory accesses. Accesses specified as aligned that aren't will fault. On Cortex-A8/A9 unaligned NEON accesses are handled by splitting them into two partial accesses. Which shows the hardware CAN support partial stores, or at least missing chunks from either end if not necessarily both. I don't think the penalties for crossing cache lines go beyond what you'd go for two distinct accesses.
One annoying gotcha I found on these processors is that NEON has no store to load forwarding (main integer unit does). Normally this isn't a huge deal if you're careful, but it hits you unknowingly if you perform an unaligned store following by an unaligned load to the address immediately after it, thanks to the partial accesses overlapping. If you're performing an RMW loop you pay for it pretty dearly, if you can't keep the loads ahead of the stores.
>I seem to recall from talking with one of the architects at ARM that they really
>got rid of all >2 cycle load/stores. I think the main motivation was simplifying
>exceptions, pipeline control and consistency.
>
>That's probably OK because ISTR they now have 128b registers to deal with double precision.
It already has 128-bit registers. And while floating point ops are 2x32-bit per cycle on A8/A9 the more common integer ops are 128-bit/cycle. Loads and stores are too, although I've hit some other bottlenecks with them (but on A8 the dcache interface for NEON is certifiably not 64-bit. Don't know about A9.)
So you can perform 2 128-bit loads/stores (hence 4 64-bit) in 2 cycles. But it has to be aligned, or it's three cycles. And some of the more complex operations with them cost a fourth cycle.
For some reason I thought the v8 overview didn't make things clear, but I looked at it again. 4-vector loads and stores are still present, and have been extended to allow 64 or 128-bit registers. So unless the v8 supporting processor the architect was referring to has a 256-bit or multiple 128-bit dcache interfaces it's going to take more than two cycles to do those instructions.
Other than that it looks like they're mostly the same - somewhat bizarrely including the alternating register options. Possibly motivated by compatibility, although with everything else they'll be breaking with NEON it's hard to see that being a huge factor..
They also stick to the highly limiting memory address modes. Strictly [reg] with a sizeof increment or an increment by another reg. That's it. It's pretty understandable since they really don't have encode space to play with but it's still a pretty big pain (fortunately Cortex-A8 tends to be at its best when doing ARM and NEON code simultaneously so it doesn't tend to hurt much from ARM generating addresses.. no idea about Cortex-A9)
>
>It's all a matter of degrees. AVX will eventually hit low power chips, it's a question
>of when. Also, you will see VFP/Neon chips that are hitting higher power levels.
>So I think the comparison is quite informative.
>
>I'd liken it to Bobcat vs. Atom. They are really different designs, aimed at different
>markets. They overlap in some areas, thus there is room for comparisons. But we
>all know that Atom doesn't hit 20W and Bobcat doesn't go under 4.5W. So they each have 'unique' areas.
>
But with Atom and Bobcat Intel actually is selling Atoms with much higher TDP than the lowest power Bobcats. Bobcat goes a little higher and Atom goes much lower but there's a lot of overlap, and more significantly, those lower power Atoms aren't really being used anywhere right now. So it's a pretty meaningful comparison.
On the flip side, we have no idea just how long it'll take until we see convergence between power usage of AVX CPUs and NEON CPUs. But I definitely think it'll come with IB or maybe Haswell first, not Atom.
>Brands don't matter much, price points do. There are SNBs available for around $100. I'd expect that to drop with IVB.
>
That's far, far too high of a price point to enter this discussion. I'm going to wait and see on IB, current rumor is same price as SB for higher end (which seems like a good move)
>Besides, with Jen-Hsun Huang talking about how Kal-El is faster than Core2...it's
>not crazy to compare it against Sandy Bridge : P
>
Let's keep nVidia's fraudulent marketing out of this ;)
>It's unrealistic to expect ALL of Intel's products to have AVX 1 year after the
>first SNB. There are still folks who want less expensive parts.
>
>David
This isn't about expecting all of Intel's products to have AVX, it's about expecting Intel's CPUs with AVX to have it enabled. Intel's actions are not only unreasonable but completely unprecedented. Coppermine-128 had SSE, P4 Celerons had SSE2, Prescott Celerons had SSE3, etc, even though these were low end and not well performing chips - and they had them before the much better competition from AMD did (or maybe that had something to do with it?).
This generation Intel went crazy with market segmentation feature fusing. I understand how it works and why it works in general and I'm not arguing against performing any feature removal or underclocking or whatever to make other parts look better and sell for more. But some of the decisions seem pulled from a hat. Fusing off instruction sets that are not special purpose is a bad idea because it discourages adoption which is needed to make the instruction set useful.
Processors like Pentium and even Celeron are higher margin than processors like Atom - or at least Atom's going to have to become lower margin to survive. So long as Intel sees AVX as only being worthy of paying highest margin for I can't see them putting it in its lowest margin chips. Won't people wonder why Atoms have it if Pentiums don't?