By: Megol (golem960.delete@this.gmail.com), April 21, 2012 1:57 am
Room: Moderated Discussions
Kira (kirsc@aeterna.ru) on 4/20/12 wrote:
---------------------------
>Joel (joel.hruska@gmail.com) on 4/20/12 wrote:
>---------------------------
>>Linus Torvalds (torvalds@linux-foundation.org) on 4/20/12 wrote:
>>---------------------------
>>>bakaneko (nyan@hyan.wan) on 4/20/12 wrote:
>>>>
>>>>Or the much simpler explanation that the FPU doubled in the
>>>>amount of registers and better opcodes which made it
>>>>faster...
>>>
>>>Yeah, looking some more at the particular benchmarks, it
>>>looks like the ones that improved in a big way are all
>>>things that might just be AVX or FMA or something.
>>>
>>>So maybe it's not so much a fragile uarch, more of just
>>>specialized benchmarks (and a sign that gcc-4.7 does a
>>>reasonable job of vectorization, perhaps)
>>>
>>>Linus
>>
>>I think fragile is the better word. Bulldozer's problem is that it does less work
>>per clock cycle and takes a performance hit by combining front-end logic (as compared
>>to a hypothetical 'true' dual-core). Its cache latencies are significantly higher
>>than Thuban's as well -- cache misses hurt Bulldozer more, and it shares more cache data than Thuban did before it.
>>
>>All of this together creates a situation where it's very hard for BD to naturally
>>shine. It shifts the problem of optimization to the compiler. The problem with Bulldozer
>>is that if you don't have everything precisely aligned, performance turns south
>>very quickly. Even when you *do* have everything precisely aligned, what you get isn't that impressive.
>
>What was the purpose of using a shared decoder even supposed to be? Is the size/power
>overhead of a pair of 4-wide decoders really that large in a modern desktop/server CPU?
>
Yes. AMDs decoders are more capable than Intels but that also means they require more resources.
The fact that Intel choose to implement a µop cache instead of enhancing their decoders should be an indication IMHO.
>Perhaps a single beefy 4-issue or 6-issue core with SMT would have been a smarter move.
I think AMD already considered and simulated such a design, in fact beefy wide execution + SMT sound like one of the K9 designs.
---------------------------
>Joel (joel.hruska@gmail.com) on 4/20/12 wrote:
>---------------------------
>>Linus Torvalds (torvalds@linux-foundation.org) on 4/20/12 wrote:
>>---------------------------
>>>bakaneko (nyan@hyan.wan) on 4/20/12 wrote:
>>>>
>>>>Or the much simpler explanation that the FPU doubled in the
>>>>amount of registers and better opcodes which made it
>>>>faster...
>>>
>>>Yeah, looking some more at the particular benchmarks, it
>>>looks like the ones that improved in a big way are all
>>>things that might just be AVX or FMA or something.
>>>
>>>So maybe it's not so much a fragile uarch, more of just
>>>specialized benchmarks (and a sign that gcc-4.7 does a
>>>reasonable job of vectorization, perhaps)
>>>
>>>Linus
>>
>>I think fragile is the better word. Bulldozer's problem is that it does less work
>>per clock cycle and takes a performance hit by combining front-end logic (as compared
>>to a hypothetical 'true' dual-core). Its cache latencies are significantly higher
>>than Thuban's as well -- cache misses hurt Bulldozer more, and it shares more cache data than Thuban did before it.
>>
>>All of this together creates a situation where it's very hard for BD to naturally
>>shine. It shifts the problem of optimization to the compiler. The problem with Bulldozer
>>is that if you don't have everything precisely aligned, performance turns south
>>very quickly. Even when you *do* have everything precisely aligned, what you get isn't that impressive.
>
>What was the purpose of using a shared decoder even supposed to be? Is the size/power
>overhead of a pair of 4-wide decoders really that large in a modern desktop/server CPU?
>
Yes. AMDs decoders are more capable than Intels but that also means they require more resources.
The fact that Intel choose to implement a µop cache instead of enhancing their decoders should be an indication IMHO.
>Perhaps a single beefy 4-issue or 6-issue core with SMT would have been a smarter move.
I think AMD already considered and simulated such a design, in fact beefy wide execution + SMT sound like one of the K9 designs.



