By: Aaron Spink (aaronspink.delete@this.notearthlink.net), January 24, 2017 8:20 am
Room: Moderated Discussions
RichardC (tich.delete@this.pobox.com) on January 24, 2017 5:50 am wrote:
> Aaron Spink (aaronspink.delete@this.notearthlink.net) on January 23, 2017 7:02 pm wrote:
> > That's great, what about the TB+ of main memory?
>
> At current prices of about $7/GB, 1024GB of DRAM is about $7K. In a supercomputer targeting
> flock-of-chickens parallelizable apps, you want a large fraction of total system cost to be going
> into cpus rather than DRAM. Let's suppose the cost ratio is 1:1. If the cpu is in the smartphone-SoC
> range, say $30, then it would be matched with about 4GB DRAM; if it's in the desktop-CPU range, say
> $250, then it would be matched with about 32GB DRAM; if it's around $1000,match it with 128GB.
>
> My guess is that what makes sense is to target problems with high cpu requirements, but relatively
> low DRAM and low communication: if over 50% of your cost is in the DRAM, then the question of whether
> you have ARM cores or x86 cores is down in the noise - you're basically buying DRAM. So I'm thinking
> the sweet spot for an ARM-based system is probably with cpu chips in the desktop-cpu range of
> $250 (with area and transistor count which give high yield), matched with relatively little DRAM,
> e.g. 8GB or 16GB. For some problems maybe 4GB is enough.
>
> So you probably have something that looks a lot like a bunch of desktop PCs on a network, but
> presumably packaged densely into a rack, with a decent interconnect, and with a multicore-CPU + GPGPU
> combination optimized for throughput-per-chip and throughput-per-watt rather than for single-thread
> performance.
>
> Taking a wild guess, let's say it's $200 for each cpu, $120 for 16GB DRAM, and $100 for interconnect,
> cpu, system overhead, giving $420 per node. Then put together 10K nodes for about $4M. As another
> wild guess, suppose we get throughput in the same ballpark as a $200-ish GPU - GTX 1060 is around
> 4TFLOPS single-precision.
>
> It won't be good for a wide range of applications. But it will be very cost-effective for a few
> applications. And it doesn't need the huge-DRAM support.
>
> For what I work on at the moment (data analytics) I totally want the small number of nodes
> w/ 1TB+ DRAM. But that's not the best solution for everything.
>
Well the TB+ DRAM if you followed the thread was supposed to be total system memory and pointing out that phone/tablet SOCs don't have ecc on memory, which would be a problem.
As far as memory per node goes, general ratios you see are in the 1 GB of memory per 20 GFLOPs of performance modulo some baseline memory overhead requirements (aka if you are only doing 1 GFLOP, you probably sill need 1GB of memory...) So a 2 TFLOP node (DP) will likely require ~100GB of memory or so.
Also, the networking at 10k nodes get very expensive, you aren't going to do that for $100 per node. A 48 port 10G switch will set you back 5k+ easy and you are going to need a lot of them, a whole whole lot of them depending on topology. At a minimum, >1 for every 48 nodes and likely >2 per 48 nodes. Realistically you are looking at ~210 48p 10G + 6p 40G switches and then another 100 48p 40g switches at 10-20k per. Total would be ~$2-3M for the switches. Most supercomputers end up spending roughly the same on networking as they do on the nodes.
> Aaron Spink (aaronspink.delete@this.notearthlink.net) on January 23, 2017 7:02 pm wrote:
> > That's great, what about the TB+ of main memory?
>
> At current prices of about $7/GB, 1024GB of DRAM is about $7K. In a supercomputer targeting
> flock-of-chickens parallelizable apps, you want a large fraction of total system cost to be going
> into cpus rather than DRAM. Let's suppose the cost ratio is 1:1. If the cpu is in the smartphone-SoC
> range, say $30, then it would be matched with about 4GB DRAM; if it's in the desktop-CPU range, say
> $250, then it would be matched with about 32GB DRAM; if it's around $1000,match it with 128GB.
>
> My guess is that what makes sense is to target problems with high cpu requirements, but relatively
> low DRAM and low communication: if over 50% of your cost is in the DRAM, then the question of whether
> you have ARM cores or x86 cores is down in the noise - you're basically buying DRAM. So I'm thinking
> the sweet spot for an ARM-based system is probably with cpu chips in the desktop-cpu range of
> $250 (with area and transistor count which give high yield), matched with relatively little DRAM,
> e.g. 8GB or 16GB. For some problems maybe 4GB is enough.
>
> So you probably have something that looks a lot like a bunch of desktop PCs on a network, but
> presumably packaged densely into a rack, with a decent interconnect, and with a multicore-CPU + GPGPU
> combination optimized for throughput-per-chip and throughput-per-watt rather than for single-thread
> performance.
>
> Taking a wild guess, let's say it's $200 for each cpu, $120 for 16GB DRAM, and $100 for interconnect,
> cpu, system overhead, giving $420 per node. Then put together 10K nodes for about $4M. As another
> wild guess, suppose we get throughput in the same ballpark as a $200-ish GPU - GTX 1060 is around
> 4TFLOPS single-precision.
>
> It won't be good for a wide range of applications. But it will be very cost-effective for a few
> applications. And it doesn't need the huge-DRAM support.
>
> For what I work on at the moment (data analytics) I totally want the small number of nodes
> w/ 1TB+ DRAM. But that's not the best solution for everything.
>
Well the TB+ DRAM if you followed the thread was supposed to be total system memory and pointing out that phone/tablet SOCs don't have ecc on memory, which would be a problem.
As far as memory per node goes, general ratios you see are in the 1 GB of memory per 20 GFLOPs of performance modulo some baseline memory overhead requirements (aka if you are only doing 1 GFLOP, you probably sill need 1GB of memory...) So a 2 TFLOP node (DP) will likely require ~100GB of memory or so.
Also, the networking at 10k nodes get very expensive, you aren't going to do that for $100 per node. A 48 port 10G switch will set you back 5k+ easy and you are going to need a lot of them, a whole whole lot of them depending on topology. At a minimum, >1 for every 48 nodes and likely >2 per 48 nodes. Realistically you are looking at ~210 48p 10G + 6p 40G switches and then another 100 48p 40g switches at 10-20k per. Total would be ~$2-3M for the switches. Most supercomputers end up spending roughly the same on networking as they do on the nodes.