By: Mark Roulo (nothanks.delete@this.xxx.com), July 11, 2013 3:54 pm
Room: Moderated Discussions
Linus Torvalds (torvalds.delete@this.linux-foundation.org) on July 11, 2013 3:12 pm wrote:
> Michael S (already5chosen.delete@this.yahoo.com) on July 11, 2013 2:43 pm wrote:
> >
> > Technically, there is nothing esoteric about AVX
>
> I'd argue that AVX is esoteric for one simple reason: it assumes that vectors
> are so important that you want to waste tons of resources on them.
>
> And don't try to make out like it's not tons of resources. Yes, you can make a fairly bad implementation
> of AVX (ie not true 256-wide units), but even that is nasty for resource tracking and the pipeline, and it
> wouldn't actually generally be any faster than XMM, and so the people who want AVX would complain anyway.
>
> So it's almost certainly better to just not fake it, since
> there isn't all that big an installed base for it anyway.
What do you consider "tons of resources?"
The 512-bit wide vector unit in Larrabee consumed 1/3 of the area of the processor core. The 256-bit wide vector unit in Haswell has to run considerably faster, but if we use the Larrabee size then we get this:
Xeon Phi:
*) Die Size: ~500 mm^2
*) Cores: ~64
*) Max Die size per core: 500/64 ~= 8 mm^2
*) 1/3 of 8 ~= 3
So ... the AVX vector unit is maybe 3 mm^2 per core. Double this to be pessimistic and we get 6 mm^2 per core.
A quad-core Haswell would thus blow 24 mm^2 on the AVX2 unit. But 1/2 of that (roughly) would be SSE4.x, so the marginal cost to AVX2 is about 12 mm^2 in a quad-core chip.
Haswell quad-cores come in at around 200 mm^2 (depending on IGPU). So you are looking at less than 10% of the chip's die area.
I'd venture that this is not great ... but what else should Intel spend the resources on?
*) The same 10% dedicated to scalar performance probably gets you 1% or less of an increase in visible performance.
*) You can shrink the die and produce more chips, but Intel's problem right now is that the fabs aren't full. And with declining PC sales, things don't look to get better in the near term.
*) Go from 4-cores to 5-cores? I'd venture that most 4-core chips have cores idling almost all the time.
This is the same sort of reasoning that, I think, led to TSX. The vast majority of codes won't care. But the folks who care will be willing to pay quite a premium for chips that have it.
The HPC crowd will also pay a premium for chips with good numeric performance, so the extra 10% of die area spent on AVX2 may well be worth it if:
a) The fabs aren't full anyway, and
b) The HPC crowd will pay a premium for chips with this feature.
For what it is worth, some of the stuff I work on will probably see a large speedup with AVX2 ... we're benchmarking now. And we tend to buy dual-socket chips in the $600-$1,000 range.
Now ... does that mean that Intel should ship AVX (or AVX2) in a low power chip? Maybe not. But I don't think the conclusion that AVX is esoteric is valid given Intel's lack of other useful things to do. The HPC crowd will pay a premium for numeric performance. What else would be useful for the mainstream chips?
> Michael S (already5chosen.delete@this.yahoo.com) on July 11, 2013 2:43 pm wrote:
> >
> > Technically, there is nothing esoteric about AVX
>
> I'd argue that AVX is esoteric for one simple reason: it assumes that vectors
> are so important that you want to waste tons of resources on them.
>
> And don't try to make out like it's not tons of resources. Yes, you can make a fairly bad implementation
> of AVX (ie not true 256-wide units), but even that is nasty for resource tracking and the pipeline, and it
> wouldn't actually generally be any faster than XMM, and so the people who want AVX would complain anyway.
>
> So it's almost certainly better to just not fake it, since
> there isn't all that big an installed base for it anyway.
What do you consider "tons of resources?"
The 512-bit wide vector unit in Larrabee consumed 1/3 of the area of the processor core. The 256-bit wide vector unit in Haswell has to run considerably faster, but if we use the Larrabee size then we get this:
Xeon Phi:
*) Die Size: ~500 mm^2
*) Cores: ~64
*) Max Die size per core: 500/64 ~= 8 mm^2
*) 1/3 of 8 ~= 3
So ... the AVX vector unit is maybe 3 mm^2 per core. Double this to be pessimistic and we get 6 mm^2 per core.
A quad-core Haswell would thus blow 24 mm^2 on the AVX2 unit. But 1/2 of that (roughly) would be SSE4.x, so the marginal cost to AVX2 is about 12 mm^2 in a quad-core chip.
Haswell quad-cores come in at around 200 mm^2 (depending on IGPU). So you are looking at less than 10% of the chip's die area.
I'd venture that this is not great ... but what else should Intel spend the resources on?
*) The same 10% dedicated to scalar performance probably gets you 1% or less of an increase in visible performance.
*) You can shrink the die and produce more chips, but Intel's problem right now is that the fabs aren't full. And with declining PC sales, things don't look to get better in the near term.
*) Go from 4-cores to 5-cores? I'd venture that most 4-core chips have cores idling almost all the time.
This is the same sort of reasoning that, I think, led to TSX. The vast majority of codes won't care. But the folks who care will be willing to pay quite a premium for chips that have it.
The HPC crowd will also pay a premium for chips with good numeric performance, so the extra 10% of die area spent on AVX2 may well be worth it if:
a) The fabs aren't full anyway, and
b) The HPC crowd will pay a premium for chips with this feature.
For what it is worth, some of the stuff I work on will probably see a large speedup with AVX2 ... we're benchmarking now. And we tend to buy dual-socket chips in the $600-$1,000 range.
Now ... does that mean that Intel should ship AVX (or AVX2) in a low power chip? Maybe not. But I don't think the conclusion that AVX is esoteric is valid given Intel's lack of other useful things to do. The HPC crowd will pay a premium for numeric performance. What else would be useful for the mainstream chips?