By: David Kanter (dkanter.delete@this.realworldtech.com), April 20, 2012 5:32 pm
Room: Moderated Discussions
Joel (joel.hruska@gmail.com) on 4/20/12 wrote:
---------------------------
>EduardoS (no@spam.com) on 4/20/12 wrote:
>---------------------------
>>Kira (kirsc@aeterna.ru) on 4/20/12 wrote:
>>---------------------------
>>>What was the purpose of using a shared decoder even supposed to be? Is the size/power
>>>overhead of a pair of 4-wide decoders really that large in a modern desktop/server CPU?
>>
>>If you look, it's the biggest shared block after L2, at almost twice the size of the shared FPU,
>>
>>>Perhaps a single beefy 4-issue or 6-issue core with SMT would have been a smarter move.
>>
>>That would sacrify clockspeed, for workloads with low instruction level parallelism
>>higher clockspeed is prefered over a wider core.
>>
>>But apparently the target choosen was wrong and SB busted (except in a few workloads) the old rule "avarage IPC < 1".
>>
>
>EduardoS,
>
>The first step in understanding Bulldozer is realizing that >the chip doesn't make much sense. ;)
>AMD's stated reason for sharing so much of the front end was to reduce die space
>and offer most of the benefit of a traditional dual-core part in a fraction of the
>die space. This made a lot of sense at the time, particularly since Intel was already
>leading them by 12-18 months when it came to moving to new >nodes.
>
>Both my own tests and those done elsewhere have indicated that sharing the front-end
>as it does "cost" Bulldozer between 10-20% of its theoretical performance. In and
>of itself, that's not bad -- compared to Thuban, they saved >more than 10-20% die space (assuming all else equal).
It's very hard to measure that without using any performance analysis tools. I don't disagree that it's a significant hit in performance, but quantifying that is challenging.
>The problem is, all else *isn't* equal. AMD stuffed >Bulldozer with cache (an eight-core
>BD has something like 16MB of cache compared to 10MB of >cache for a 6-core Thuban).
Having lots of cache is good for server workloads.
>That blows the die-size savings apart...which might still be ok, if the caches were
>fast. They aren't. In fact, they're painfully slow. Because every L1 write is duplicated
>in L2, L1 write latency is effectively pinned to L2 write >latency.
You are correct that the caches are painfully slow, but that's not the reason why. Frankly, I don't understand why the L1 is 4 cycles instead of 3. I REALLY don't understand why the L2 cache is so slow (20 cycles, really??), because size alone doesn't account for it. 12-14 cycles sounds much more reasonable.
The L3 cache is also quite slow, in part because of the slow L2 and in part because it runs at asynchronous to the cores. If you look at those two factors together and assume a 14 cycle L2, you can probably cut the L3 latency down by ~10 cycles.
The fact that the L1 is write through is totally irrelevant to latency and should actually improve things because AMD got rid of ECC on the L1. The stores go directly to the 4KB write combining cache and then written back to the L2 on a deferred basis.
http://www.realworldtech.com/page.cfm?ArticleID=RWT082610181333&p=9
>I still believe BD's biggest problem is its cache >latencies, but the chip as it
>shipped last year is a badly flawed piece of work.
I think the cache hierarchy overall is definitely one of the biggest culprits. Hopefully they will fix things in the future.
DK
---------------------------
>EduardoS (no@spam.com) on 4/20/12 wrote:
>---------------------------
>>Kira (kirsc@aeterna.ru) on 4/20/12 wrote:
>>---------------------------
>>>What was the purpose of using a shared decoder even supposed to be? Is the size/power
>>>overhead of a pair of 4-wide decoders really that large in a modern desktop/server CPU?
>>
>>If you look, it's the biggest shared block after L2, at almost twice the size of the shared FPU,
>>
>>>Perhaps a single beefy 4-issue or 6-issue core with SMT would have been a smarter move.
>>
>>That would sacrify clockspeed, for workloads with low instruction level parallelism
>>higher clockspeed is prefered over a wider core.
>>
>>But apparently the target choosen was wrong and SB busted (except in a few workloads) the old rule "avarage IPC < 1".
>>
>
>EduardoS,
>
>The first step in understanding Bulldozer is realizing that >the chip doesn't make much sense. ;)
>AMD's stated reason for sharing so much of the front end was to reduce die space
>and offer most of the benefit of a traditional dual-core part in a fraction of the
>die space. This made a lot of sense at the time, particularly since Intel was already
>leading them by 12-18 months when it came to moving to new >nodes.
>
>Both my own tests and those done elsewhere have indicated that sharing the front-end
>as it does "cost" Bulldozer between 10-20% of its theoretical performance. In and
>of itself, that's not bad -- compared to Thuban, they saved >more than 10-20% die space (assuming all else equal).
It's very hard to measure that without using any performance analysis tools. I don't disagree that it's a significant hit in performance, but quantifying that is challenging.
>The problem is, all else *isn't* equal. AMD stuffed >Bulldozer with cache (an eight-core
>BD has something like 16MB of cache compared to 10MB of >cache for a 6-core Thuban).
Having lots of cache is good for server workloads.
>That blows the die-size savings apart...which might still be ok, if the caches were
>fast. They aren't. In fact, they're painfully slow. Because every L1 write is duplicated
>in L2, L1 write latency is effectively pinned to L2 write >latency.
You are correct that the caches are painfully slow, but that's not the reason why. Frankly, I don't understand why the L1 is 4 cycles instead of 3. I REALLY don't understand why the L2 cache is so slow (20 cycles, really??), because size alone doesn't account for it. 12-14 cycles sounds much more reasonable.
The L3 cache is also quite slow, in part because of the slow L2 and in part because it runs at asynchronous to the cores. If you look at those two factors together and assume a 14 cycle L2, you can probably cut the L3 latency down by ~10 cycles.
The fact that the L1 is write through is totally irrelevant to latency and should actually improve things because AMD got rid of ECC on the L1. The stores go directly to the 4KB write combining cache and then written back to the L2 on a deferred basis.
http://www.realworldtech.com/page.cfm?ArticleID=RWT082610181333&p=9
>I still believe BD's biggest problem is its cache >latencies, but the chip as it
>shipped last year is a badly flawed piece of work.
I think the cache hierarchy overall is definitely one of the biggest culprits. Hopefully they will fix things in the future.
DK



