Article: Parallelism at HotPar 2010
By: Michael S (already5chosen.delete@this.yahoo.com), August 7, 2010 11:12 am
Room: Moderated Discussions
Carlie Coats (coats@baronams.com) on 8/7/10 wrote:
---------------------------
>>David Kanter (dkanter@realworldtech.com) on 8/3/10 wrote:
>>---------------------------
>>> I get the sense that they made a deliberate decision to optimize
>>> for throughput at the cost of per-core performance (whereas
>>> Intel has not made that choice for mainstream parts).
>
>Gabriele Svelto (gabriele.svelto@gmail.com) on 8/4/10 wrote:
>> That's debatable, Nehalem doesn't seem to offer much improvement
>> in per-core performance over Core 2 (in my experience at least)
>> and where it does it's marginal and partly due to its ability
>> to dynamically "out-clock" Core 2 on single/lightly threaded
>> workloads. Some of the choices Intel made in Nehalem were clearly
>> detrimental for that type of workloads (like split L2 caches)
>> but offered significant advantages for throughput.
>
>Richard Cownie (tich@pobox.com) 8/4/10 wrote
>> I daresay there are particular examples for which that
>> is true. But my own experience with a big app is
>> completely the opposite: just a couple of weeks ago
>I ran the exact same executable on a Core2 Xeon 2.93GHz
>> and a Nehalem Xeon 2.93GHz, and got 1.51x speedup.
>> And these are still 45nm parts without the TurboBoost
>> trick.
>>
>> That's a pure single-threaded app, so there's no benefit
>> from the hyperthreading.
>
>For multi-threaded apps, we have a benchmark on a North America
>scale MM5 meteorology model.
>
>Xeon 5460 Xeon 5570 speedup for 5570
>
>16-core 6621 sec 2727 sec 2.428
>
>32-core 5722 sec 1859 sec 3.078
>
>We're clearly climbing the scaling-curve for this benchmark,
>but still the results are quite suggestive. And clearly
>the Nehalem is not only faster, it also scales better.
>
>FWIW.
Neither Harpertown (Penryn core) nor Gainestown (Nehalem core) "scale" natively above 8-core system size. So when you compare scaling from 16-core to 32-core it says much more about Infiniband cards used in respective systems than about CPUs or memory.
---------------------------
>>David Kanter (dkanter@realworldtech.com) on 8/3/10 wrote:
>>---------------------------
>>> I get the sense that they made a deliberate decision to optimize
>>> for throughput at the cost of per-core performance (whereas
>>> Intel has not made that choice for mainstream parts).
>
>Gabriele Svelto (gabriele.svelto@gmail.com) on 8/4/10 wrote:
>> That's debatable, Nehalem doesn't seem to offer much improvement
>> in per-core performance over Core 2 (in my experience at least)
>> and where it does it's marginal and partly due to its ability
>> to dynamically "out-clock" Core 2 on single/lightly threaded
>> workloads. Some of the choices Intel made in Nehalem were clearly
>> detrimental for that type of workloads (like split L2 caches)
>> but offered significant advantages for throughput.
>
>Richard Cownie (tich@pobox.com) 8/4/10 wrote
>> I daresay there are particular examples for which that
>> is true. But my own experience with a big app is
>> completely the opposite: just a couple of weeks ago
>I ran the exact same executable on a Core2 Xeon 2.93GHz
>> and a Nehalem Xeon 2.93GHz, and got 1.51x speedup.
>> And these are still 45nm parts without the TurboBoost
>> trick.
>>
>> That's a pure single-threaded app, so there's no benefit
>> from the hyperthreading.
>
>For multi-threaded apps, we have a benchmark on a North America
>scale MM5 meteorology model.
>
>Xeon 5460 Xeon 5570 speedup for 5570
>
>16-core 6621 sec 2727 sec 2.428
>
>32-core 5722 sec 1859 sec 3.078
>
>We're clearly climbing the scaling-curve for this benchmark,
>but still the results are quite suggestive. And clearly
>the Nehalem is not only faster, it also scales better.
>
>FWIW.
Neither Harpertown (Penryn core) nor Gainestown (Nehalem core) "scale" natively above 8-core system size. So when you compare scaling from 16-core to 32-core it says much more about Infiniband cards used in respective systems than about CPUs or memory.