Article: Nehalem Performance Preview
By: Vincent Diepeveen (, April 11, 2009 3:50 pm
Michael S ( on 4/10/09 wrote:
>Vincent Diepeveen ( on 4/10/09 wrote:
>>Jack ( on 4/9/09 wrote:
>>>Vincent Diepeveen ( on 4/7/09 wrote:
>>>>How can you draw a conclusion about Shanghai, you haven't even compared it head on with Nehalem yourself.
>>>David did characterize, at the beginning of the article, that Shanghai would be
>>>fairly characterize as slightly lagging harpertown, in that it falls behind in some
>>>cases, achieves parity and others, and has some strong points.
>>>Considering that is roughly a good assessment, then it can be extrapolated that Nehalem has opened up a wide margin.
>>>Nonentheless, you can search the databases yourself, the 5570 DP Xeon can range
>>>anywhere from 1.5 to 2x faster than a Shanghai 2P (2.7 GHz). I have not found one
>>>where Shanghai even comes close. This does not make Shanghai a bad CPU, but it
>>>does make it tough for AMD to market Shanghai against Nehalem.
>>Sales ballony, based upon a few cracked spec tests.
>>Both are nearly identical processors in performance for the software we tried.
>>Of course HT and turboboost turned off, and Shanghai a tad higher clocked than
>>E versions of Xeon, gives Shanghai a slight edge in clockrate 2.53Ghz vs 2.7 shanghai.
>>Of course with more powerbudget intel clocks higher.
>BTW, AMD submitted SpecJbb scores for 2.9GHz Shanghai. In the past it was the
>indication that next lower clocked part, i,e, 2.8GHz, will soon be available in
>normal thermal envelop. So there is a hope for 2.8GHz 75W Shanghai coming.
>>If you look to the intel documents in what i7 can execute it is SSE2+ instructions
>>a cycle max. That gives 8 flops as a max, with or without HT. Is that so much higher than AMD?
>>Multiplication is not faster than at AMD in throughput, in fact if you try latencies
>>of AMD are better, so a good programmer CAN be faster at AMD.
>Let's follow you own logic. Floating-point addition is faster (=had shorter latency)
>on Intel. Should we conclude that "a good programmer CAN be faster at Intel".

There is 1 unit that is doing multiplication,
there is a lot that can do addition.

Addition has a latency of 0.5 cycle at intel so to speak and 0.33 cycle or so at AMD (i could be off by 0.17 or so as i checked the i7 handbooks quickly a while ago for all kind of stuff, not the AMD ones).

Multiplication is important for FFT and matrix calculations. Adding goes rather quick. Enough units to do it. Just 1 for multiplication.

So that matters, the rest doesn't except when it becomes a bottleneck.

>BTW, what sort of multiplication has shorter latency on AMD? I can think only about
>integer 64x64=>128b that is very rare in hand-coded asm and never generated by compilers.
>>Yet these differences are that tiny, that any claim there of 50%+ is total ballony.
>It is absolutely correct that on single-threaded dense computational kernels running
>out of L1/L2 cache and at the same clock frequency Intel's Merom, Penryn, Nehalem
>aas well as AMD's Greyhound (Barcelona) and Shanghai cores are all within few per
>cents +- of each other. If the kernels did not include packed SSE instructions then
>you could add Intel' Dothan, Yonah and AMD Hammer to the same list. Actually, in
>my own scalar kernels I found Nehalem rather consistently lagging behind Penryn.
>although that was not the case for SIMD kernels.
>However that absolutely correct observation has nothing to do with the industry standard benchmarks on real CPUs that
>A. Multithreaded
>B. Run at different clock frequencies
>C. Not dense
>D. Could fit in onchip cache on one CPU but not on another (that what hurts Bracelona most)
>Points A+C and D give big edge to Nehalem because it has both SMT and much faster
>external memory access than any of competitors.
>>It's just compiler and L3 based all these claims for just a few software programs.
>>If you see clearly how moving from intel c++ 10.0 to 11.0 is a huge improvement
>>at core2 already, obviously the compiler team did do a great job.
>Are you sure that Oracle and IBM JVMs compiled with Intel compiler?
>>I see i7 as a very logical step from core2, yet for performance a tiny step, if
>>you look single socket. Of course this allows intel now to scale to 2 and maybe soon 4 sockets.
>>>However, if you are truly interested in how Nehalem stacks up against Shanghai,
>>>you can search the following, scores for both processors are now in the database:
>>A $100 billion company, well it used to be, that just gets benchmarked for a few
>>programs, that's not a rather good idea.
>>Turboboost, HT, more power budget, it sure helps to look better.
>>It's not a help in practice; the X series that eat so much more power, that these
>>can increase their clockfrequency a lot more with turboboost than the E series,
>>whereas all HPC centers will be buying nearly always the E series and some, also
>>datacenters, already announced turboboost will get turned off. If i look at ebay
>>now, the L series 54xx still are a lot more expensive than the E series, how comes?
>>>Also, using a QX9770 in the comparision is not a bad idea, but it is also not a
>>>server branded CPU. It is irrelevant anyway, a few hundred MHz won't change the result.
