By: dmcq (dmcq.delete@this.fano.co.uk), January 24, 2017 8:44 am
Room: Moderated Discussions
Aaron Spink (aaronspink.delete@this.notearthlink.net) on January 24, 2017 7:20 am wrote:
> RichardC (tich.delete@this.pobox.com) on January 24, 2017 5:50 am wrote:
> > Aaron Spink (aaronspink.delete@this.notearthlink.net) on January 23, 2017 7:02 pm wrote:
> > > That's great, what about the TB+ of main memory?
> >
> > At current prices of about $7/GB, 1024GB of DRAM is about $7K. In a supercomputer targeting
> > flock-of-chickens parallelizable apps, you want a large fraction of total system cost to be going
> > into cpus rather than DRAM. Let's suppose the cost ratio is 1:1. If the cpu is in the smartphone-SoC
> > range, say $30, then it would be matched with about 4GB DRAM; if it's in the desktop-CPU range, say
> > $250, then it would be matched with about 32GB DRAM; if it's around $1000,match it with 128GB.
> >
> > My guess is that what makes sense is to target problems with high cpu requirements, but relatively
> > low DRAM and low communication: if over 50% of your cost is in the DRAM, then the question of whether
> > you have ARM cores or x86 cores is down in the noise - you're basically buying DRAM. So I'm thinking
> > the sweet spot for an ARM-based system is probably with cpu chips in the desktop-cpu range of
> > $250 (with area and transistor count which give high yield), matched with relatively little DRAM,
> > e.g. 8GB or 16GB. For some problems maybe 4GB is enough.
> >
> > So you probably have something that looks a lot like a bunch of desktop PCs on a network, but
> > presumably packaged densely into a rack, with a decent interconnect, and with a multicore-CPU + GPGPU
> > combination optimized for throughput-per-chip and throughput-per-watt rather than for single-thread
> > performance.
> >
> > Taking a wild guess, let's say it's $200 for each cpu, $120 for 16GB DRAM, and $100 for interconnect,
> > cpu, system overhead, giving $420 per node. Then put together 10K nodes for about $4M. As another
> > wild guess, suppose we get throughput in the same ballpark as a $200-ish GPU - GTX 1060 is around
> > 4TFLOPS single-precision.
> >
> > It won't be good for a wide range of applications. But it will be very cost-effective for a few
> > applications. And it doesn't need the huge-DRAM support.
> >
> > For what I work on at the moment (data analytics) I totally want the small number of nodes
> > w/ 1TB+ DRAM. But that's not the best solution for everything.
> >
> Well the TB+ DRAM if you followed the thread was supposed to be total system memory and
> pointing out that phone/tablet SOCs don't have ecc on memory, which would be a problem.
>
> As far as memory per node goes, general ratios you see are in the 1 GB of memory per 20 GFLOPs of performance
> modulo some baseline memory overhead requirements (aka if you are only doing 1 GFLOP, you probably sill
> need 1GB of memory...) So a 2 TFLOP node (DP) will likely require ~100GB of memory or so.
>
> Also, the networking at 10k nodes get very expensive, you aren't going to do that for $100 per
> node. A 48 port 10G switch will set you back 5k+ easy and you are going to need a lot of them,
> a whole whole lot of them depending on topology. At a minimum, >1 for every 48 nodes and likely
> >2 per 48 nodes. Realistically you are looking at ~210 48p 10G + 6p 40G switches and then another
> 100 48p 40g switches at 10-20k per. Total would be ~$2-3M for the switches. Most supercomputers
> end up spending roughly the same on networking as they do on the nodes.
>
I'm not sure what all this discussion about using smartphone SoCs in a supercomputer is about. I can't see why anyone would do that except in some test or prototype environment or to get themselves on youtube for doing something wacky.
> RichardC (tich.delete@this.pobox.com) on January 24, 2017 5:50 am wrote:
> > Aaron Spink (aaronspink.delete@this.notearthlink.net) on January 23, 2017 7:02 pm wrote:
> > > That's great, what about the TB+ of main memory?
> >
> > At current prices of about $7/GB, 1024GB of DRAM is about $7K. In a supercomputer targeting
> > flock-of-chickens parallelizable apps, you want a large fraction of total system cost to be going
> > into cpus rather than DRAM. Let's suppose the cost ratio is 1:1. If the cpu is in the smartphone-SoC
> > range, say $30, then it would be matched with about 4GB DRAM; if it's in the desktop-CPU range, say
> > $250, then it would be matched with about 32GB DRAM; if it's around $1000,match it with 128GB.
> >
> > My guess is that what makes sense is to target problems with high cpu requirements, but relatively
> > low DRAM and low communication: if over 50% of your cost is in the DRAM, then the question of whether
> > you have ARM cores or x86 cores is down in the noise - you're basically buying DRAM. So I'm thinking
> > the sweet spot for an ARM-based system is probably with cpu chips in the desktop-cpu range of
> > $250 (with area and transistor count which give high yield), matched with relatively little DRAM,
> > e.g. 8GB or 16GB. For some problems maybe 4GB is enough.
> >
> > So you probably have something that looks a lot like a bunch of desktop PCs on a network, but
> > presumably packaged densely into a rack, with a decent interconnect, and with a multicore-CPU + GPGPU
> > combination optimized for throughput-per-chip and throughput-per-watt rather than for single-thread
> > performance.
> >
> > Taking a wild guess, let's say it's $200 for each cpu, $120 for 16GB DRAM, and $100 for interconnect,
> > cpu, system overhead, giving $420 per node. Then put together 10K nodes for about $4M. As another
> > wild guess, suppose we get throughput in the same ballpark as a $200-ish GPU - GTX 1060 is around
> > 4TFLOPS single-precision.
> >
> > It won't be good for a wide range of applications. But it will be very cost-effective for a few
> > applications. And it doesn't need the huge-DRAM support.
> >
> > For what I work on at the moment (data analytics) I totally want the small number of nodes
> > w/ 1TB+ DRAM. But that's not the best solution for everything.
> >
> Well the TB+ DRAM if you followed the thread was supposed to be total system memory and
> pointing out that phone/tablet SOCs don't have ecc on memory, which would be a problem.
>
> As far as memory per node goes, general ratios you see are in the 1 GB of memory per 20 GFLOPs of performance
> modulo some baseline memory overhead requirements (aka if you are only doing 1 GFLOP, you probably sill
> need 1GB of memory...) So a 2 TFLOP node (DP) will likely require ~100GB of memory or so.
>
> Also, the networking at 10k nodes get very expensive, you aren't going to do that for $100 per
> node. A 48 port 10G switch will set you back 5k+ easy and you are going to need a lot of them,
> a whole whole lot of them depending on topology. At a minimum, >1 for every 48 nodes and likely
> >2 per 48 nodes. Realistically you are looking at ~210 48p 10G + 6p 40G switches and then another
> 100 48p 40g switches at 10-20k per. Total would be ~$2-3M for the switches. Most supercomputers
> end up spending roughly the same on networking as they do on the nodes.
>
I'm not sure what all this discussion about using smartphone SoCs in a supercomputer is about. I can't see why anyone would do that except in some test or prototype environment or to get themselves on youtube for doing something wacky.