By: Michael S (already5chosen.delete@this.yahoo.com), July 13, 2013 12:13 pm
Room: Moderated Discussions
Linus Torvalds (torvalds.delete@this.linux-foundation.org) on July 13, 2013 12:08 pm wrote:
> That was very informative of you. However, you seem to forget that we've
> actually gone through this before, in the MMX->XMM transition with SSE2.
>
> So we actually know how people tried to save transistors and effort before, when the 64->128
> bit expansion happened. It was AMD back then, but they did exactly the simple half-wide approach.
> And it didn't work out all that well. They had exactly the issues I brought up.
>
No, both AMD and Intel. P-III, all P4 variant, all three generations of PM.
And it worked. Less so for double-precision, because 2x SIMD is often not enough to bother. But for single-precision it worked pretty well. And those were "fat" chips with 3-way front end, 3-way retirement and decent renaming. In lean chip, like Silverment, the gain will be bigger.
> And we had that same "unaligned loads are slow" issue, which was a disaster
> too, and eventually fixed - by basically doubling the memory access path.
>
AVX helps this issue as well.
Without AVX either your unaligned load+op through exception or you have to use non-standard AMD-only control register. Or something like that, I don't remember full details.
With AVX unaligned load+op is perfectly legal. Fast or not, depends on the hardware, but unaligned exceptions carp does not stay on the way. And I don't see a reason why on 128-bit implementation 128-bit aligned 256-bit-wide access would be any slower than 2 separate 128-bit accesses. I also know, from optimization reference manual, that it's *not* slower on Jaguar. On the other hand, I do see reasons why it could be faster, esp. outside of the inner loops.
Even in physically-unaligned case, i.e. when 256-bit access non-128-bit aligned location, relatively trivial hardware can do it at cost of 3 basic 128-bit accesses. When you do the same in 128-bit pieces it will cost you 4 basic accesses.
> So we've seen the the whole SSE->AVX thing before, just in
> the guise of MMX->SSE. The issues aren't that different..
>
> But at the same time there's a big difference: the upsides
> have shrunk. SSE is "good enough" for most things.
>
Wait. You will see how Silvermont can't keep its 128b-wide execution units even half-busy and how Jaguar, that also has 128b-wide execution units runs circles around Silvermont happily executing AVX code paths.
> Five years from now? Who knows? Maybe people will clamor for AVX. And with another shrink or two, the
> costs are smaller too. Right now I'm not seeing it. You (or anybody else, for that matter) haven't brought
> up any realistic case that would be relevant on mobile or microservers that would warrant AVX.
>
> So why are you so convinced AVX makes sense?
>
> Linus
Not for the reasons I stated above.
But because it makes programmers (mine) life easier.
I don't write applications for "horizontal" market. I write for "vertical".
So, even today, when installed based is predominantly non-AVX, I can easily ignore non-AVX case. In those rare cases where we do not supply computers to the customer together with the software we can simply tell them to buy up-to-date stuff.
So far so good, but Silvermont changes the rules - it *is* up-to-date, but it does not run AVX. So, either we have to support (and test) non-AVX variant of the software for many more years than we want or we will have to ban Silvermont. Guess what, 90% chance that we'll chose the later.
And it's a pity, because Silvermont *is* promising in some situations, and in some not-uncommon cases it will have (or would have, with AVX) sufficient performance even for demanding applications like ours. And tablet form factor does open new possibilities.
You say, then buy Jaguar, if you like it so much? But most likely Jaguar simply would not be available [in tablets] from 1-tier and 2-tier OEMs. As it looks now, Jaguar is going to end up as netbook-only chip.
> That was very informative of you. However, you seem to forget that we've
> actually gone through this before, in the MMX->XMM transition with SSE2.
>
> So we actually know how people tried to save transistors and effort before, when the 64->128
> bit expansion happened. It was AMD back then, but they did exactly the simple half-wide approach.
> And it didn't work out all that well. They had exactly the issues I brought up.
>
No, both AMD and Intel. P-III, all P4 variant, all three generations of PM.
And it worked. Less so for double-precision, because 2x SIMD is often not enough to bother. But for single-precision it worked pretty well. And those were "fat" chips with 3-way front end, 3-way retirement and decent renaming. In lean chip, like Silverment, the gain will be bigger.
> And we had that same "unaligned loads are slow" issue, which was a disaster
> too, and eventually fixed - by basically doubling the memory access path.
>
AVX helps this issue as well.
Without AVX either your unaligned load+op through exception or you have to use non-standard AMD-only control register. Or something like that, I don't remember full details.
With AVX unaligned load+op is perfectly legal. Fast or not, depends on the hardware, but unaligned exceptions carp does not stay on the way. And I don't see a reason why on 128-bit implementation 128-bit aligned 256-bit-wide access would be any slower than 2 separate 128-bit accesses. I also know, from optimization reference manual, that it's *not* slower on Jaguar. On the other hand, I do see reasons why it could be faster, esp. outside of the inner loops.
Even in physically-unaligned case, i.e. when 256-bit access non-128-bit aligned location, relatively trivial hardware can do it at cost of 3 basic 128-bit accesses. When you do the same in 128-bit pieces it will cost you 4 basic accesses.
> So we've seen the the whole SSE->AVX thing before, just in
> the guise of MMX->SSE. The issues aren't that different..
>
> But at the same time there's a big difference: the upsides
> have shrunk. SSE is "good enough" for most things.
>
Wait. You will see how Silvermont can't keep its 128b-wide execution units even half-busy and how Jaguar, that also has 128b-wide execution units runs circles around Silvermont happily executing AVX code paths.
> Five years from now? Who knows? Maybe people will clamor for AVX. And with another shrink or two, the
> costs are smaller too. Right now I'm not seeing it. You (or anybody else, for that matter) haven't brought
> up any realistic case that would be relevant on mobile or microservers that would warrant AVX.
>
> So why are you so convinced AVX makes sense?
>
> Linus
Not for the reasons I stated above.
But because it makes programmers (mine) life easier.
I don't write applications for "horizontal" market. I write for "vertical".
So, even today, when installed based is predominantly non-AVX, I can easily ignore non-AVX case. In those rare cases where we do not supply computers to the customer together with the software we can simply tell them to buy up-to-date stuff.
So far so good, but Silvermont changes the rules - it *is* up-to-date, but it does not run AVX. So, either we have to support (and test) non-AVX variant of the software for many more years than we want or we will have to ban Silvermont. Guess what, 90% chance that we'll chose the later.
And it's a pity, because Silvermont *is* promising in some situations, and in some not-uncommon cases it will have (or would have, with AVX) sufficient performance even for demanding applications like ours. And tablet form factor does open new possibilities.
You say, then buy Jaguar, if you like it so much? But most likely Jaguar simply would not be available [in tablets] from 1-tier and 2-tier OEMs. As it looks now, Jaguar is going to end up as netbook-only chip.