By: Joel (joel.hruska.delete@this.gmail.com), April 20, 2012 2:59 pm
Room: Moderated Discussions
Linus Torvalds (torvalds@linux-foundation.org) on 4/20/12 wrote:
---------------------------
>bakaneko (nyan@hyan.wan) on 4/20/12 wrote:
>>
>>Or the much simpler explanation that the FPU doubled in the
>>amount of registers and better opcodes which made it
>>faster...
>
>Yeah, looking some more at the particular benchmarks, it
>looks like the ones that improved in a big way are all
>things that might just be AVX or FMA or something.
>
>So maybe it's not so much a fragile uarch, more of just
>specialized benchmarks (and a sign that gcc-4.7 does a
>reasonable job of vectorization, perhaps)
>
>Linus
I think fragile is the better word. Bulldozer's problem is that it does less work per clock cycle and takes a performance hit by combining front-end logic (as compared to a hypothetical 'true' dual-core). Its cache latencies are significantly higher than Thuban's as well -- cache misses hurt Bulldozer more, and it shares more cache data than Thuban did before it.
All of this together creates a situation where it's very hard for BD to naturally shine. It shifts the problem of optimization to the compiler. The problem with Bulldozer is that if you don't have everything precisely aligned, performance turns south very quickly. Even when you *do* have everything precisely aligned, what you get isn't that impressive.
---------------------------
>bakaneko (nyan@hyan.wan) on 4/20/12 wrote:
>>
>>Or the much simpler explanation that the FPU doubled in the
>>amount of registers and better opcodes which made it
>>faster...
>
>Yeah, looking some more at the particular benchmarks, it
>looks like the ones that improved in a big way are all
>things that might just be AVX or FMA or something.
>
>So maybe it's not so much a fragile uarch, more of just
>specialized benchmarks (and a sign that gcc-4.7 does a
>reasonable job of vectorization, perhaps)
>
>Linus
I think fragile is the better word. Bulldozer's problem is that it does less work per clock cycle and takes a performance hit by combining front-end logic (as compared to a hypothetical 'true' dual-core). Its cache latencies are significantly higher than Thuban's as well -- cache misses hurt Bulldozer more, and it shares more cache data than Thuban did before it.
All of this together creates a situation where it's very hard for BD to naturally shine. It shifts the problem of optimization to the compiler. The problem with Bulldozer is that if you don't have everything precisely aligned, performance turns south very quickly. Even when you *do* have everything precisely aligned, what you get isn't that impressive.



