Article: Parallelism at HotPar 2010
By: Mark Roulo (nothanks.delete@this.xxx.com), August 6, 2010 7:53 am
Room: Moderated Discussions
Richard Cownie (tich@pobox.com) on 8/4/10 wrote:
---------------------------
>Gabriele Svelto (gabriele.svelto@gmail.com) on 8/4/10 wrote:
>---------------------------
>>That's debatable, Nehalem doesn't seem to offer much improvement in per-core performance
>>over Core 2 (in my experience at least)
>
>I daresay there are particular examples for which that
>is true. But my own experience with a big app is
>completely the opposite: just a couple of weeks ago
>I ran the exact same executable on a Core2 Xeon 2.93GHz
>and a Nehalem Xeon 2.93GHz, and got 1.51x speedup.
>And these are still 45nm parts without the TurboBoost
>trick.
>
>That's a pure single-threaded app, so there's no benefit
>from the hyperthreading.
>
>It seems like a really big win. And I get the impression
>that most people see Nehalem that way.
>
>You're welcome to have your opinion, based on your own
>experience. But I don't think it matches what most
>people have measured.
>
I think that Intel was trying to do several things with/for the Nehalem core:
1) They wanted to increase scalar performance for single-threaded loads.
2) They wanted to be better at throughput loads.
3) They wanted to be able to scale the individual chips all the way from laptops to servers (but not worry about the 1-5 watt target)
Nehalem was the result.
They achieved (1) for most loads by adding the on-chip memory controller. This cuts the latency when going to main memory and seems to be very important for most integer loads (not all, of course). Especially loads with lots of pointer chasing (like Java, maybe). They did some micro-architectural optimizations, too, but I'm pretty sure that the big win was the on-chip memory controller.
They went after (2) with the on-chip memory controller as well (now multiple chips don't choke on the FSB) and also increased the per-chip bandwidth. Loads that my company used to run were strangled by the FSB ... now performance scaled much more linearly.
Finally, the underlying building blocks were supposed to be easily "composable" ... 2, 4, 6 or 8 cores ... big L3 or small L3 ... enable/disable Hyperthreading, which allowed them to add whacking-large caches to make server chips. I actually expected a bigger range of L3/core sizes than we've seen. This enables (3), which allows for a full range of chips (with, I'm guessing, minimal re-qualification).
On balance, I am quite impressed with the effort. The single-threaded performance went up for most loads compared to Core2 chips (based on SpecInt ... I think ... I checked a while ago, but might be misremembering). Obviously not for all loads, but for many/most. Like mine, when running only one thread at a time. :-)
The highly threaded loads also benefited quite a bit.
Could Intel have focused more on single threaded performance? Sure. But it isn't like the single threaded performance was ignored with Nehalem, either.
What is interesting is to watch AMD with the upcoming Bulldozer core(s). This *really* looks like a chip with a bias in favor of throughput computing. I'm not sure that this is a good idea for the x86 market, but I'm also not convinced that AMD can duke it out with Intel on single threaded performance, either.
-Mark Roulo
---------------------------
>Gabriele Svelto (gabriele.svelto@gmail.com) on 8/4/10 wrote:
>---------------------------
>>That's debatable, Nehalem doesn't seem to offer much improvement in per-core performance
>>over Core 2 (in my experience at least)
>
>I daresay there are particular examples for which that
>is true. But my own experience with a big app is
>completely the opposite: just a couple of weeks ago
>I ran the exact same executable on a Core2 Xeon 2.93GHz
>and a Nehalem Xeon 2.93GHz, and got 1.51x speedup.
>And these are still 45nm parts without the TurboBoost
>trick.
>
>That's a pure single-threaded app, so there's no benefit
>from the hyperthreading.
>
>It seems like a really big win. And I get the impression
>that most people see Nehalem that way.
>
>You're welcome to have your opinion, based on your own
>experience. But I don't think it matches what most
>people have measured.
>
I think that Intel was trying to do several things with/for the Nehalem core:
1) They wanted to increase scalar performance for single-threaded loads.
2) They wanted to be better at throughput loads.
3) They wanted to be able to scale the individual chips all the way from laptops to servers (but not worry about the 1-5 watt target)
Nehalem was the result.
They achieved (1) for most loads by adding the on-chip memory controller. This cuts the latency when going to main memory and seems to be very important for most integer loads (not all, of course). Especially loads with lots of pointer chasing (like Java, maybe). They did some micro-architectural optimizations, too, but I'm pretty sure that the big win was the on-chip memory controller.
They went after (2) with the on-chip memory controller as well (now multiple chips don't choke on the FSB) and also increased the per-chip bandwidth. Loads that my company used to run were strangled by the FSB ... now performance scaled much more linearly.
Finally, the underlying building blocks were supposed to be easily "composable" ... 2, 4, 6 or 8 cores ... big L3 or small L3 ... enable/disable Hyperthreading, which allowed them to add whacking-large caches to make server chips. I actually expected a bigger range of L3/core sizes than we've seen. This enables (3), which allows for a full range of chips (with, I'm guessing, minimal re-qualification).
On balance, I am quite impressed with the effort. The single-threaded performance went up for most loads compared to Core2 chips (based on SpecInt ... I think ... I checked a while ago, but might be misremembering). Obviously not for all loads, but for many/most. Like mine, when running only one thread at a time. :-)
The highly threaded loads also benefited quite a bit.
Could Intel have focused more on single threaded performance? Sure. But it isn't like the single threaded performance was ignored with Nehalem, either.
What is interesting is to watch AMD with the upcoming Bulldozer core(s). This *really* looks like a chip with a bias in favor of throughput computing. I'm not sure that this is a good idea for the x86 market, but I'm also not convinced that AMD can duke it out with Intel on single threaded performance, either.
-Mark Roulo