By: Linus Torvalds (torvalds.delete@this.linux-foundation.org), July 12, 2013 10:31 am
Room: Moderated Discussions
Linus Torvalds (torvalds.delete@this.linux-foundation.org) on July 12, 2013 9:16 am wrote:
>
> But I still don't think that AVX makes sense in a small chip. The engineering trade-offs just aren't there.
Side note: the biggest part of "tons of resources" is likely the internal buses and the memory subsystem in particular.
Making the actual execution units wider is probably not too painful. It's more transistors (and in a mobile chip, more power), but being intelligent about clock gating etc is something Intel already has to do, so in the end the wider execution units probably wouldn't hurt much except when used, and then they'd quite possibly help more than they'd hurt.
And making the execution units full-width probably helps simplify things that matter more, and that you can't turn off easily: the pipeline control units etc. So on the whole I suspect execution units are fairly cheap, and almost certainly preferable over adding pipeline complexity for running AVX ops twice through half-wide units etc.
But the thing that kills wide vectors in mobile is that you need to be able to feed those units to actually get any of the upside. That means really wide internal buses, and in particular it means a very beefy memory subsystem. Doing full 256-bit paths inside the core itself is already likely somewhat painful, but to the memory subsystem? That's worse.
And the thing is, for many vector loads, you don't just want the aligned case. You might be ok with just "unit alignment" (ie 4 bytes for vectors of singles), but for some loads you really want byte alignment. Just look at the history of "load aligned" vs proper loads. So you'd really want 2x the 256-bit buses to the caches (particularly for reads) to be able to do unaligned accesses well, and to be able to do aligned cases really well.
Sure, there are all those idiotic benchmarks that just check the latency of single ops, and don't actually need any memory accesses, because they just feed the same register back over and over. But apart from bad benchmarks, nobody cares about those. You need to feed the beast.
Do you think it's coincidence that Haswell has those studly memory units? Do you think it's just random that Haswell does two 256-bit loads per cycle from its L1 cache? Nope. It's what you need to do to actually take advantage of AVX.
Now, I'm a huge fan of good memory units, but I do not believe you can reasonably afford (at least yet) the kind of units that Haswell has in a mobile part.
And without the ability to feed AVX, what's the point, really? XMM is a much better fit, and gets you most of the low-hanging vectorization fruit. Certainly gets you most of the existing code-base.
Linus
>
> But I still don't think that AVX makes sense in a small chip. The engineering trade-offs just aren't there.
Side note: the biggest part of "tons of resources" is likely the internal buses and the memory subsystem in particular.
Making the actual execution units wider is probably not too painful. It's more transistors (and in a mobile chip, more power), but being intelligent about clock gating etc is something Intel already has to do, so in the end the wider execution units probably wouldn't hurt much except when used, and then they'd quite possibly help more than they'd hurt.
And making the execution units full-width probably helps simplify things that matter more, and that you can't turn off easily: the pipeline control units etc. So on the whole I suspect execution units are fairly cheap, and almost certainly preferable over adding pipeline complexity for running AVX ops twice through half-wide units etc.
But the thing that kills wide vectors in mobile is that you need to be able to feed those units to actually get any of the upside. That means really wide internal buses, and in particular it means a very beefy memory subsystem. Doing full 256-bit paths inside the core itself is already likely somewhat painful, but to the memory subsystem? That's worse.
And the thing is, for many vector loads, you don't just want the aligned case. You might be ok with just "unit alignment" (ie 4 bytes for vectors of singles), but for some loads you really want byte alignment. Just look at the history of "load aligned" vs proper loads. So you'd really want 2x the 256-bit buses to the caches (particularly for reads) to be able to do unaligned accesses well, and to be able to do aligned cases really well.
Sure, there are all those idiotic benchmarks that just check the latency of single ops, and don't actually need any memory accesses, because they just feed the same register back over and over. But apart from bad benchmarks, nobody cares about those. You need to feed the beast.
Do you think it's coincidence that Haswell has those studly memory units? Do you think it's just random that Haswell does two 256-bit loads per cycle from its L1 cache? Nope. It's what you need to do to actually take advantage of AVX.
Now, I'm a huge fan of good memory units, but I do not believe you can reasonably afford (at least yet) the kind of units that Haswell has in a mobile part.
And without the ability to feed AVX, what's the point, really? XMM is a much better fit, and gets you most of the low-hanging vectorization fruit. Certainly gets you most of the existing code-base.
Linus