By: anon (anon.delete@this.anon.com), April 22, 2012 4:17 pm
Room: Moderated Discussions
EduardoS (no@spam.com) on 4/22/12 wrote:
---------------------------
>anon (anon@anon.com) on 4/22/12 wrote:
>---------------------------
>>I don't understand what you mean. I have seen lots of BD's benchmarks, yes. I have
>>also seen the statistics for its caches, and they are not good. End of story.
>
>Statistics like latency and bandwidth or statistics like real world benchmarks?
Both.
>
>>Oh, I was looking at 2.8GHz version.
>
>Surprise, in the 32nm SOI, after the 1.2V or so getting a higher clock requires a much higher voltage.
>
>>Totally different power constraints and architecture, so again I don't know what this is proving.
>
>No power constraints, just timming, timming is what makes a cache latency 4 cycles
>instead of 3, if at same voltage BD hits a 33% higher clock than Llano than 4 cycles
>on BD means the same time on BD as on Llano, same path lenght, or whatever, not
>exactly same since BD qill require one extra buffer, anyway, doesn't matter, or
>you are a complete moron or is making a big effort to not understand.
I understand latency and absolute latency. These are not the only factors.
>
>>What I mean is that I don't know how much sense it makes to compare Llano with
>>its parent. And also even if you do know about AMD's 32nm process, then it doesn't
>>affect that BD is not a good architecture for that process. The real world is really
>>the only way to ultimately evaluate it.
>
>You can't say if the uarch is good or bad without knowing if the process is good
>or bad, hell I bet a 8086 on 32nm will outperform a SB on 1 micron, if you talk
>about uarch ignoring limitations of the process you are a moron.
>
>>Wrong. POWER7 is manufactured on 45nm, and has 2 cycle load to use 32KB 8 way L1
>>cache, and clocks up to nearly the same as BD. It uses more power of course, but it does vastly more work per clock.
>
>And have a simple memory pipeline...
That explains it, does it?
>
>Now please, use your brain and tell me, why the same doesn't apply to SB?
The same what? It has 2x larger caches and 2x more ways, at the same latency.
>
>>Err, their L1 is tiny. This is my whole point, which is why BD's cache design is
>>not a good one. What is so hard to grasp about this?
>
>So your point is that BD cache sucks because it is different...
No, my point is that it sucks because it has a very small (and not very fast) L1 cache, and a very slow L2 cache. Stop putting words into my mouth.
>
>What an argument...
>
>Power7 32kB for 4 threads, 8kB per thread
>UltraSPARC T4 16kB for 8 threads, 2kB per thread
>Sandy Bridge 32kB for 2 threads, 16kB per thread
POWER7, 8 cycle load-to-use L2 cache.
SB, 12 cycle load to use L2 cache.
And also, in single threaded performance, the picture obviously changes significantly.
>
>Of course, with only one thread Power7 and SB will have more cache avaliable but
>hardly the priority were single-threaded performance, they
Yes, but that does not mean it is a good core because you can ignore the fact that single thread performance sucks.
Real world code, even a lot of parallel code, requires both.
have such amount becuase
>there is more than one thread per core, on L1 cache per core BD is no worse than
>any of the other designs, in fact up to 8 times higher than one of them, with only
>one thread it is smaller, but just 2 times.
>
>The only chip with more L1 data cache per thread is Greyhound, but also is the
>chip with the most obsolete memory pipeline, have you ever tought about how improving
>the memory pipeline (and here I'm talking about everything from the load/store units
>to the memory controller) may reduce the impact of an L1 miss*?
>
>So your argument about the L1 is not just weak, it is completly wrong.
What argument about the L1? My point is that L1 is small, and L2 is slow. The end.
>
>* Pre-emptive response: No, the big, 20 cycle L2 doesn't means a weaker memory
>pipeline, the fact it is different from other doesn't means it is worse, the load/store
>units have improved a lot, everything else was sized from that based on it capabilities
>(wich were not present on Greyhound), not the other way around.
>
>Side note: When AMD where fighting Core 2 Duo it was common to see people on forums
>"sugesting" that AMD should drop the L3 cache and have bigger L2 like Intel, when
>Intel released Nehalem people start sugesting AMD should have a smaller and faster
>L2 instead of that big slow and obsolete 512kB L2 cache, funny, don't you think?
Your reply is full of rhetorical questions, putting words in my mouth, calling me a moron, and strawman arguments. It's amazing.
---------------------------
>anon (anon@anon.com) on 4/22/12 wrote:
>---------------------------
>>I don't understand what you mean. I have seen lots of BD's benchmarks, yes. I have
>>also seen the statistics for its caches, and they are not good. End of story.
>
>Statistics like latency and bandwidth or statistics like real world benchmarks?
Both.
>
>>Oh, I was looking at 2.8GHz version.
>
>Surprise, in the 32nm SOI, after the 1.2V or so getting a higher clock requires a much higher voltage.
>
>>Totally different power constraints and architecture, so again I don't know what this is proving.
>
>No power constraints, just timming, timming is what makes a cache latency 4 cycles
>instead of 3, if at same voltage BD hits a 33% higher clock than Llano than 4 cycles
>on BD means the same time on BD as on Llano, same path lenght, or whatever, not
>exactly same since BD qill require one extra buffer, anyway, doesn't matter, or
>you are a complete moron or is making a big effort to not understand.
I understand latency and absolute latency. These are not the only factors.
>
>>What I mean is that I don't know how much sense it makes to compare Llano with
>>its parent. And also even if you do know about AMD's 32nm process, then it doesn't
>>affect that BD is not a good architecture for that process. The real world is really
>>the only way to ultimately evaluate it.
>
>You can't say if the uarch is good or bad without knowing if the process is good
>or bad, hell I bet a 8086 on 32nm will outperform a SB on 1 micron, if you talk
>about uarch ignoring limitations of the process you are a moron.
>
>>Wrong. POWER7 is manufactured on 45nm, and has 2 cycle load to use 32KB 8 way L1
>>cache, and clocks up to nearly the same as BD. It uses more power of course, but it does vastly more work per clock.
>
>And have a simple memory pipeline...
That explains it, does it?
>
>Now please, use your brain and tell me, why the same doesn't apply to SB?
The same what? It has 2x larger caches and 2x more ways, at the same latency.
>
>>Err, their L1 is tiny. This is my whole point, which is why BD's cache design is
>>not a good one. What is so hard to grasp about this?
>
>So your point is that BD cache sucks because it is different...
No, my point is that it sucks because it has a very small (and not very fast) L1 cache, and a very slow L2 cache. Stop putting words into my mouth.
>
>What an argument...
>
>Power7 32kB for 4 threads, 8kB per thread
>UltraSPARC T4 16kB for 8 threads, 2kB per thread
>Sandy Bridge 32kB for 2 threads, 16kB per thread
POWER7, 8 cycle load-to-use L2 cache.
SB, 12 cycle load to use L2 cache.
And also, in single threaded performance, the picture obviously changes significantly.
>
>Of course, with only one thread Power7 and SB will have more cache avaliable but
>hardly the priority were single-threaded performance, they
Yes, but that does not mean it is a good core because you can ignore the fact that single thread performance sucks.
Real world code, even a lot of parallel code, requires both.
have such amount becuase
>there is more than one thread per core, on L1 cache per core BD is no worse than
>any of the other designs, in fact up to 8 times higher than one of them, with only
>one thread it is smaller, but just 2 times.
>
>The only chip with more L1 data cache per thread is Greyhound, but also is the
>chip with the most obsolete memory pipeline, have you ever tought about how improving
>the memory pipeline (and here I'm talking about everything from the load/store units
>to the memory controller) may reduce the impact of an L1 miss*?
>
>So your argument about the L1 is not just weak, it is completly wrong.
What argument about the L1? My point is that L1 is small, and L2 is slow. The end.
>
>* Pre-emptive response: No, the big, 20 cycle L2 doesn't means a weaker memory
>pipeline, the fact it is different from other doesn't means it is worse, the load/store
>units have improved a lot, everything else was sized from that based on it capabilities
>(wich were not present on Greyhound), not the other way around.
>
>Side note: When AMD where fighting Core 2 Duo it was common to see people on forums
>"sugesting" that AMD should drop the L3 cache and have bigger L2 like Intel, when
>Intel released Nehalem people start sugesting AMD should have a smaller and faster
>L2 instead of that big slow and obsolete 512kB L2 cache, funny, don't you think?
Your reply is full of rhetorical questions, putting words in my mouth, calling me a moron, and strawman arguments. It's amazing.



