number of sockets is wrong metric (was: New article: 8 socket commodity servers)

Article: 8-Socket Commodity Servers: Flourish or Perish?
By: Vincent Diepeveen (diep.delete@this.xs4all.nl), March 14, 2010 12:31 pm
Room: Moderated Discussions
longtimelurker (rwt@nospam.maibaums.net) on 3/14/10 wrote:
---------------------------
>Vincent Diepeveen (diep@xs4all.nl) on 3/9/10 wrote:
>
>>
>>I'm not an AMD fanboy, but really this goes too far.
>>
>>You realize it's start 2010 and you write about a cheap solution that's there for
>>many years now total unrivalled i price by intel: "in theory".
>>
>
>I have to agree that the article seems very intel-biased. 8S-Opteron was a viable
>platform if you really needed the number of memory channels and most importantly
>the total memory size. They didn't scale even close to linearly compared with 4S,
>but were still significantly faster. If you just care about number of cores and
>can parallelize well, you would build a cluster of 1S or 2S machines anyway.

This is dead wrong in case of my software.

Latency to RAM of a cluster is *far worse*, and also you need to use MPI there usually. There isn't really good SSI software out there, or the SSI software doesn't work at all for the interconnects or something else, like it using the wrong packet size of communication, or it is not capable of migrating shared memory, or doesn't have migration at all.

Actual practical latency from a cluster is closer to say 5-7 microseconds to get a cacheline from a random memory position (so we do not speak of streaming byte latency here which is a rewritten form of bandwidth).

The cheapo 8 socket machines you speak of some hundreds of nanoseconds only of latency.

That is a huge difference of factor 10 in latency, quite crucial in fact to a lot of software.

Furthermore it is really complicated to rewrite a shared memory to MPI usually. Also you suddenly move from a lazy concept to an active polling model; as quadrics has gone bankrupt there isn't any interconnects now AFAIK that use a shared memory model in a sane manner.

Some newer MPI forms would 'incorporate it' but practical it works ugly. Until then you will need a thread that is going to receive the packets and avoid and check for overflows.

So you have spaghetticode everywhere in your software, which a shared memory program doesn't have. And there is huge overhead of MPI.

All that with factor 20 slower latencies.

Factor 20 is a LOT.

Now we still speak about bestcase performance.

all those numbers from mpi type clusters are theoretic numbers on latencies. They measure 1 thread that is fulltime busy receiving the MPI packets, which simply isn't what happens in reality. You can't lose cpu power.

What happens actually in reality is that if you have 1 network card with nowadays say 32 cores hammering onto that single network card, that a lot of cores will suffer the latency of the NIC itself; its switch latency from thread to thread.

Not seldom that's 100 microseconds or more. The fact that a manufacturer like quadrics specified it as 50 microseconds and other manufacturers did NOT want to specify it, is selfexplaining.

This is a real pain to program for.

So for software, like game tree search, where latency to a shared hashtable is crucial for performance of the entire program, it's a huge difference.

Now we didn't discuss yet the price.

what's price of a good 8 socket setup versus price of a cluster of similar performance?

Just the network already is more expensive in fact than the entire 8 socket machine.

Let's assume we buy a 48 core magny cours for the price AMD quotes on its weblog which is roughly $8500.

That's 48 cores times 2.2Ghz or something = 105Ghz.

Speedup of my software out of 48 cores would be say roughly 36 out of 48 (it is a guess).

So that's 105.6Ghz * 36 / 48 = 79.2Ghz
Let's round it off to 80Ghz.

Now we want to make a cheapo cluster. First problem: there isn't single socket 12 core processors.

So we take the budget processors we have available ok?

Intel we directly can skip as the cpu's are too expensive.

So we buy Phenom2 nodes of 3.4Ghz. Say 500 euro per node, network excluded.

What's equivalent cluster compared to that 48 core box?
Right it's quite a lot.

Effective speedup. Maybe i would even manage 20% speedup out of cluster. Who knows? Let's optimistic guess 20%.

80Ghz * 5 = 400Ghz needed.

A single quadcore delivers 3.4 * 4 = 13.6Ghz

So we need 400Ghz need / 13.6Ghz = 29 nodes cluster.

So practical we need a 32 node cluster to rival a single 48 core box for this type of game tree search.

Now next issue is the price of the network of that 32 node cluster. Let's say each node costs 1000 euro in network price and another 5000 euro for the switch.

So total costs then is
1500 euro * 32 nodes + 5000 = 53000 euro.

We didn't discuss your powerbill yet.

2 socket machines aren't going to make the calculation easier, as then the latency to the network is going to get an even bigger problem was we get 2 times the number of nodes per second.

Please note this is an optimistic calculation. I'll skip some technical details why, but there is another loss of 20% somewhere.

Designing a new algorithmic setup especially for that cluster in order to suffer less from it all is real interesting, but also really fulltime work.

At a shared memory 8 socket machine you can simply use your normal SMP algorithm with a few small tweaks and testing (which also takes a few months to get right).

>Which brings me to the actual point I'm trying to make: Isn't the number of sockets
>an antiquated metric? What you are really interested in with those big boxes is
>the total number and throughput of the memory channels and that they're reasonably low latency from any core.
>IBM showed off Nehalem systems at Cebit that used external QPI links to attach
>extra enclosures just full of memory, and with Magny-Cours AMD just stuffed significantly
>more bandwidth and lower latency into 4S than they previously had at 8S. Not sure
>if there'll be an 8S Magny-Cours, but it'll likely suffer from similar problems
>as the previous Opteron 8000. Maybe someone will build a 6S, the Chips are called Opteron 6000 after all.
>-Felix

It is about the price we pay for a good latency to each core from the shared memory. How many sockets you intend doing that with i couldn't care less. Build me a 1024 core box that can deliver that latency for a cheap price i'd say, if it doesn't matter from your viewpoint.

Realize a couple of hundreds of nanoseconds to the RAM latency which such 8 socket setup will give, already is really tough to program for.

Comparing that to a cluster makes zero sense.

Vincent
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
New article: 8 socket commodity serversDavid Kanter2010/03/09 11:27 AM
  New article: 8 socket commodity serversVincent Diepeveen2010/03/09 02:46 PM
    number of sockets is wrong metric (was: New article: 8 socket commodity servers)longtimelurker2010/03/14 06:13 AM
      number of sockets is wrong metric (was: New article: 8 socket commodity servers)EduardoS2010/03/14 06:34 AM
      number of sockets is wrong metric (was: New article: 8 socket commodity servers)Wes Felter2010/03/14 11:33 AM
        Magny-CoursMax2010/03/14 05:56 PM
          Magny-Coursanonymous2010/03/14 07:33 PM
            Magny-Courslongtimelurker2010/03/15 03:54 AM
      number of sockets is wrong metric (was: New article: 8 socket commodity servers)Vincent Diepeveen2010/03/14 12:31 PM
        number of sockets is wrong metric (was: New article: 8 socket commodity servers)longtimelurker2010/03/14 02:37 PM
          number of sockets is wrong metric (was: New article: 8 socket commodity servers)Vincent Diepeveen2010/03/15 12:36 PM
      number of sockets is wrong metric (was: New article: 8 socket commodity servers)David Kanter2010/03/14 12:56 PM
        Bad mathDavid Kanter2010/04/01 02:24 AM
      number of sockets is wrong metric (was: New article: 8 socket commodity servers)slacker2010/03/14 03:51 PM
        number of sockets is wrong metric (was: New article: 8 socket commodity servers)Michael S2010/03/15 06:05 AM
          number of sockets is wrong metric (was: New article: 8 socket commodity servers)slacker2010/03/15 02:02 PM
            Memory interfacesDavid Kanter2010/03/15 02:17 PM
              Memory interfacesslacker2010/03/15 10:08 PM
                Patents on tiny components vs. large, complex thingsmpx2010/03/16 12:41 AM
                  Patents on tiny components vs. large, complex thingsRichard Cownie2010/03/16 06:58 AM
                    Patents on tiny components vs. large, complex thingsMS2010/03/17 06:42 PM
                      Patents on tiny components vs. large, complex thingsa reader2010/03/18 09:45 PM
          Serial Port Memory TechnologyDavid Hess2010/03/21 04:32 AM
  New article: 8 socket commodity serversMichael S2010/03/09 04:13 PM
    New article: 8 socket commodity serverstheluketaylor2010/03/09 06:32 PM
    New article: 8 socket commodity serversJesper Frimann2010/03/09 11:35 PM
    New article: 8 socket commodity serversDavid Kanter2010/03/10 01:38 AM
      New article: 8 socket commodity serversTim2010/03/16 09:44 AM
  New article: 8 socket commodity serversanon2010/03/09 07:59 PM
    New article: 8 socket commodity serversDavid Kanter2010/03/10 12:06 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell green?