Article: Parallelism at HotPar 2010
By: Michael S (already5chosen.delete@this.yahoo.com), August 5, 2010 3:21 pm
Room: Moderated Discussions
Richard Cownie (tich@pobox.com) on 8/5/10 wrote:
---------------------------
>
>We needed big memory (256GB). Beyond that, I'm in a sizable
>corporation where the detailed purchasing choices are
>taken elsewhere, in a separate organization :-(
>It doesn't surprise me that we've got a
>Core-based machine that kinda sucks. But then we've also
>got a Nehalem-based machine that kinda sucks, because
>the state of the art is 32nm Westmere/Gulftown at up
>to 3.6GHz. So in that sense the playing field is level.
>
Actually, I'd say that in 2007-2008 E7220 was pretty smart buy for a single-threaded application that wants a lot of memory. It can go cheaply to 64 GB and not extraordinary expensively to 128 GB, has uniform latency over all its memory space and you don't have to plug more than one CPU.
It was possible to have the same amount of memory with Opteron, but it would take plugging 4 relatively expensive CPUs and (remember, Shanghai didn't exist yet) dealing with small cache and non-uniform latency. Or, alternatively, finding relatively rare dual-socket Opteron board with 16 DIMM sockets and filling it up with slow 266MHz DDR1 memory. Or, may be, with Barcelona you could use 400MHz DDR2 and still have 4 DIMMs per channel? I already forgot the details.
>Anyhow, I think another possible explanation of Intel's
>choices is that Core2 architecture was optimized for
>45nm and at most 4 cores on a die;
In fact, Core2 has either 6 cores per die (Dunnington) or 2 cores per die (the rest).
> whereas Nehalem was
>optimized for 32nm and up to 6 cores on a die. And
>that leads to different choices about cache sizes and
>latencies and all kinds of other stuff.
I prefer the same theory as Gabriele - Nehalem is optimized first and foremost for throughput. Improvements in single-thread performance come either as by-products of enhancements in system architecture (IMC, smart power management with turbo-boost) or due to enhancements in micro-architecture that are orthogonal to latency-vs-throughput trade offs (fast rep movsd, fast unaligned SIMD loads/stores, better loop detector). On the other hand, major changes in cache hierarchy are IMHO hurting single-threaded performance much more often than they are helping.
Another proof is the absence of high-end dual-cores in Nehalem products line. It seems very probable that at 32 nm Intel is capable of producing very fast dual-core chips, may be around 4.5 GHz for the top bin. However the decided to keep the fastest dual-cores at 3.6 GHz i.e. just 4% above quad-core and just 8% above hexa-core.
>
>We do also have some Core2 3.2GHz blades with smaller
>memory configurations, and those seem to beat our
>Nehalem 2.66GHz equivalents on some benchmarks. Though
>they also seem really flaky to me, so I avoid them
>as much as possible.
>
Flaky? My English is not good enough to decipher the meaning.
---------------------------
>
>We needed big memory (256GB). Beyond that, I'm in a sizable
>corporation where the detailed purchasing choices are
>taken elsewhere, in a separate organization :-(
>It doesn't surprise me that we've got a
>Core-based machine that kinda sucks. But then we've also
>got a Nehalem-based machine that kinda sucks, because
>the state of the art is 32nm Westmere/Gulftown at up
>to 3.6GHz. So in that sense the playing field is level.
>
Actually, I'd say that in 2007-2008 E7220 was pretty smart buy for a single-threaded application that wants a lot of memory. It can go cheaply to 64 GB and not extraordinary expensively to 128 GB, has uniform latency over all its memory space and you don't have to plug more than one CPU.
It was possible to have the same amount of memory with Opteron, but it would take plugging 4 relatively expensive CPUs and (remember, Shanghai didn't exist yet) dealing with small cache and non-uniform latency. Or, alternatively, finding relatively rare dual-socket Opteron board with 16 DIMM sockets and filling it up with slow 266MHz DDR1 memory. Or, may be, with Barcelona you could use 400MHz DDR2 and still have 4 DIMMs per channel? I already forgot the details.
>Anyhow, I think another possible explanation of Intel's
>choices is that Core2 architecture was optimized for
>45nm and at most 4 cores on a die;
In fact, Core2 has either 6 cores per die (Dunnington) or 2 cores per die (the rest).
> whereas Nehalem was
>optimized for 32nm and up to 6 cores on a die. And
>that leads to different choices about cache sizes and
>latencies and all kinds of other stuff.
I prefer the same theory as Gabriele - Nehalem is optimized first and foremost for throughput. Improvements in single-thread performance come either as by-products of enhancements in system architecture (IMC, smart power management with turbo-boost) or due to enhancements in micro-architecture that are orthogonal to latency-vs-throughput trade offs (fast rep movsd, fast unaligned SIMD loads/stores, better loop detector). On the other hand, major changes in cache hierarchy are IMHO hurting single-threaded performance much more often than they are helping.
Another proof is the absence of high-end dual-cores in Nehalem products line. It seems very probable that at 32 nm Intel is capable of producing very fast dual-core chips, may be around 4.5 GHz for the top bin. However the decided to keep the fastest dual-cores at 3.6 GHz i.e. just 4% above quad-core and just 8% above hexa-core.
>
>We do also have some Core2 3.2GHz blades with smaller
>memory configurations, and those seem to beat our
>Nehalem 2.66GHz equivalents on some benchmarks. Though
>they also seem really flaky to me, so I avoid them
>as much as possible.
>
Flaky? My English is not good enough to decipher the meaning.