Article: Parallelism at HotPar 2010
By: Michael S (already5chosen.delete@this.yahoo.com), August 7, 2010 10:12 am
Room: Moderated Discussions
Carlie Coats (coats@baronams.com) on 8/7/10 wrote:
---------------------------
>>David Kanter (dkanter@realworldtech.com) on 8/3/10 wrote:
>>---------------------------
>>> I get the sense that they made a deliberate decision to optimize
>>> for throughput at the cost of per-core performance (whereas
>>> Intel has not made that choice for mainstream parts).
>
>Gabriele Svelto (gabriele.svelto@gmail.com) on 8/4/10 wrote:
>> That's debatable, Nehalem doesn't seem to offer much improvement
>> in per-core performance over Core 2 (in my experience at least)
>> and where it does it's marginal and partly due to its ability
>> to dynamically "out-clock" Core 2 on single/lightly threaded
>> workloads. Some of the choices Intel made in Nehalem were clearly
>> detrimental for that type of workloads (like split L2 caches)
>> but offered significant advantages for throughput.
>
>Richard Cownie (tich@pobox.com) 8/4/10 wrote
>> I daresay there are particular examples for which that
>> is true. But my own experience with a big app is
>> completely the opposite: just a couple of weeks ago
>I ran the exact same executable on a Core2 Xeon 2.93GHz
>> and a Nehalem Xeon 2.93GHz, and got 1.51x speedup.
>> And these are still 45nm parts without the TurboBoost
>> trick.
>>
>> That's a pure single-threaded app, so there's no benefit
>> from the hyperthreading.
>
>For multi-threaded apps, we have a benchmark on a North America
>scale MM5 meteorology model.
>
>Xeon 5460 Xeon 5570 speedup for 5570
>
>16-core 6621 sec 2727 sec 2.428
>
>32-core 5722 sec 1859 sec 3.078
>
>We're clearly climbing the scaling-curve for this benchmark,
>but still the results are quite suggestive. And clearly
>the Nehalem is not only faster, it also scales better.
>
>FWIW.
BTW, why X5460?
Of all Harpertown Xeons X5460 appears to be least suited for HPC clusters since it combines high 120W TDP with relatively modest 1333 MT/s system bus. If you go for 120W you could as well buy 1600 MT/s X5472. But low power L5430 seem the most attractive. In memory bandwidth starved environment it would give nearly the same performance as X5460 in more than twice lower power envelop.
---------------------------
>>David Kanter (dkanter@realworldtech.com) on 8/3/10 wrote:
>>---------------------------
>>> I get the sense that they made a deliberate decision to optimize
>>> for throughput at the cost of per-core performance (whereas
>>> Intel has not made that choice for mainstream parts).
>
>Gabriele Svelto (gabriele.svelto@gmail.com) on 8/4/10 wrote:
>> That's debatable, Nehalem doesn't seem to offer much improvement
>> in per-core performance over Core 2 (in my experience at least)
>> and where it does it's marginal and partly due to its ability
>> to dynamically "out-clock" Core 2 on single/lightly threaded
>> workloads. Some of the choices Intel made in Nehalem were clearly
>> detrimental for that type of workloads (like split L2 caches)
>> but offered significant advantages for throughput.
>
>Richard Cownie (tich@pobox.com) 8/4/10 wrote
>> I daresay there are particular examples for which that
>> is true. But my own experience with a big app is
>> completely the opposite: just a couple of weeks ago
>I ran the exact same executable on a Core2 Xeon 2.93GHz
>> and a Nehalem Xeon 2.93GHz, and got 1.51x speedup.
>> And these are still 45nm parts without the TurboBoost
>> trick.
>>
>> That's a pure single-threaded app, so there's no benefit
>> from the hyperthreading.
>
>For multi-threaded apps, we have a benchmark on a North America
>scale MM5 meteorology model.
>
>Xeon 5460 Xeon 5570 speedup for 5570
>
>16-core 6621 sec 2727 sec 2.428
>
>32-core 5722 sec 1859 sec 3.078
>
>We're clearly climbing the scaling-curve for this benchmark,
>but still the results are quite suggestive. And clearly
>the Nehalem is not only faster, it also scales better.
>
>FWIW.
BTW, why X5460?
Of all Harpertown Xeons X5460 appears to be least suited for HPC clusters since it combines high 120W TDP with relatively modest 1333 MT/s system bus. If you go for 120W you could as well buy 1600 MT/s X5472. But low power L5430 seem the most attractive. In memory bandwidth starved environment it would give nearly the same performance as X5460 in more than twice lower power envelop.