By: dmcq (dmcq.delete@this.fano.co.uk), January 24, 2017 11:10 am
Room: Moderated Discussions
RichardC (tich.delete@this.pobox.com) on January 24, 2017 9:34 am wrote:
> Aaron Spink (aaronspink.delete@this.notearthlink.net) on January 24, 2017 7:20 am wrote:
>
> > Well the TB+ DRAM if you followed the thread was supposed to be total system memory and
> > pointing out that phone/tablet SOCs don't have ecc on memory, which would be a problem.
>
> It's definitely nice to have ECC on DRAM, just as it is on servers. And phone SoCs don't
> have the interconnect you would want either - so it (almost certainly) isn't going to be
> an unmodified phone SoC. But it can be something quite closely related to parts of a phone
> SoC/server SoC.
>
> > As far as memory per node goes, general ratios you see are
> > in the 1 GB of memory per 20 GFLOPs of performance
> > modulo some baseline memory overhead requirements (aka if you are only doing 1 GFLOP, you probably sill
> > need 1GB of memory...) So a 2 TFLOP node (DP) will likely require ~100GB of memory or so.
>
> That's a circular argument. For a machine which can deals with a wide variety of problems,
> it's reasonable. For a more specialized machine, it can be huge overkill. For example,
> a GTX 1080 w/ 9TFLOPS single-precision and 8GB, for a ratio of 1GB to 1125GFLOPS, which is
> about 56x away from your figure. Or consider dedicated bitcoin-mining rigs, which have a
> whole bunch of parallel-hashing ASICs and no DRAM (and in that case a fairly
> trivial interconnect such as a single USB link up to a master machine).
>
> So you're assuming it's a machine which can handle the same kinds of problems as a big cluster
> of Xeon-based nodes, and then you're criticizing the ARM-based chips for not having
> the same capabilities as Xeons. But this is a more specialized architecture optimized
> to give better price-performance on a narrow subset of problems. There'd be no
> point if it was the same - being different is what makes it interesting.
>
> > Also, the networking at 10k nodes get very expensive, you aren't going to do that for $100 per
> > node. A 48 port 10G switch will set you back 5k+ easy and you are going to need a lot of them,
> > a whole whole lot of them depending on topology.
>
> Again, you're making assumptions about what the network has to look like, and your assumptions
> are based on being able to run a wide variety of applications, some of which
> have a high ratio of communication/compute.
>
> This kind of system would probably look very different: first, it would have a large number
> of nodes on each board, e.g. 16 or 32 nodes, possibly with some cheap local interconnect
> (e.g. PCIe switch chips). PCIe can also go between boards in a rack, within reason.
> But maybe you only target applications with sufficiently low communication/compute that
> 2 x 10Gbit out of a 4U box, or between racks, is enough.
>
> > 100 48p 40g switches at 10-20k per. Total would be ~$2-3M for the switches. Most supercomputers
> > end up spending roughly the same on networking as they do on the nodes.
>
> Right, but the point of this is to build a system optimized for particular problems
> which don't need everything that a rackful-of-Xeon's give you.
>
> *If* you assume that you need an expensive interconnect, then the argument for using
> unorthodox high-throughput-per-dollar or high-throughput-per-watt cpu's goes away.
> So unorthodox cpu's are appropriate *only* for systems which also have an unorthodox
> (cheaper, lower-bandwidth) interconnect *and* cheaper (fewer GB per TFLOPS) DRAM.
> Which in turn means that it only works well for a subset of applications,
> but if CFD is what you need and it works for CFD, then it's all good.
I think Fujitsu is targeting a pretty general type supercomputer design with its Post-K CPU, basically something huge that is flexible enough to deal with all sorts of different HPC problems.
> Aaron Spink (aaronspink.delete@this.notearthlink.net) on January 24, 2017 7:20 am wrote:
>
> > Well the TB+ DRAM if you followed the thread was supposed to be total system memory and
> > pointing out that phone/tablet SOCs don't have ecc on memory, which would be a problem.
>
> It's definitely nice to have ECC on DRAM, just as it is on servers. And phone SoCs don't
> have the interconnect you would want either - so it (almost certainly) isn't going to be
> an unmodified phone SoC. But it can be something quite closely related to parts of a phone
> SoC/server SoC.
>
> > As far as memory per node goes, general ratios you see are
> > in the 1 GB of memory per 20 GFLOPs of performance
> > modulo some baseline memory overhead requirements (aka if you are only doing 1 GFLOP, you probably sill
> > need 1GB of memory...) So a 2 TFLOP node (DP) will likely require ~100GB of memory or so.
>
> That's a circular argument. For a machine which can deals with a wide variety of problems,
> it's reasonable. For a more specialized machine, it can be huge overkill. For example,
> a GTX 1080 w/ 9TFLOPS single-precision and 8GB, for a ratio of 1GB to 1125GFLOPS, which is
> about 56x away from your figure. Or consider dedicated bitcoin-mining rigs, which have a
> whole bunch of parallel-hashing ASICs and no DRAM (and in that case a fairly
> trivial interconnect such as a single USB link up to a master machine).
>
> So you're assuming it's a machine which can handle the same kinds of problems as a big cluster
> of Xeon-based nodes, and then you're criticizing the ARM-based chips for not having
> the same capabilities as Xeons. But this is a more specialized architecture optimized
> to give better price-performance on a narrow subset of problems. There'd be no
> point if it was the same - being different is what makes it interesting.
>
> > Also, the networking at 10k nodes get very expensive, you aren't going to do that for $100 per
> > node. A 48 port 10G switch will set you back 5k+ easy and you are going to need a lot of them,
> > a whole whole lot of them depending on topology.
>
> Again, you're making assumptions about what the network has to look like, and your assumptions
> are based on being able to run a wide variety of applications, some of which
> have a high ratio of communication/compute.
>
> This kind of system would probably look very different: first, it would have a large number
> of nodes on each board, e.g. 16 or 32 nodes, possibly with some cheap local interconnect
> (e.g. PCIe switch chips). PCIe can also go between boards in a rack, within reason.
> But maybe you only target applications with sufficiently low communication/compute that
> 2 x 10Gbit out of a 4U box, or between racks, is enough.
>
> > 100 48p 40g switches at 10-20k per. Total would be ~$2-3M for the switches. Most supercomputers
> > end up spending roughly the same on networking as they do on the nodes.
>
> Right, but the point of this is to build a system optimized for particular problems
> which don't need everything that a rackful-of-Xeon's give you.
>
> *If* you assume that you need an expensive interconnect, then the argument for using
> unorthodox high-throughput-per-dollar or high-throughput-per-watt cpu's goes away.
> So unorthodox cpu's are appropriate *only* for systems which also have an unorthodox
> (cheaper, lower-bandwidth) interconnect *and* cheaper (fewer GB per TFLOPS) DRAM.
> Which in turn means that it only works well for a subset of applications,
> but if CFD is what you need and it works for CFD, then it's all good.
I think Fujitsu is targeting a pretty general type supercomputer design with its Post-K CPU, basically something huge that is flexible enough to deal with all sorts of different HPC problems.