By: RichardC (tich.delete@this.pobox.com), January 24, 2017 9:34 am
Room: Moderated Discussions
Aaron Spink (aaronspink.delete@this.notearthlink.net) on January 24, 2017 7:20 am wrote:
> Well the TB+ DRAM if you followed the thread was supposed to be total system memory and
> pointing out that phone/tablet SOCs don't have ecc on memory, which would be a problem.
It's definitely nice to have ECC on DRAM, just as it is on servers. And phone SoCs don't
have the interconnect you would want either - so it (almost certainly) isn't going to be
an unmodified phone SoC. But it can be something quite closely related to parts of a phone
SoC/server SoC.
> As far as memory per node goes, general ratios you see are in the 1 GB of memory per 20 GFLOPs of performance
> modulo some baseline memory overhead requirements (aka if you are only doing 1 GFLOP, you probably sill
> need 1GB of memory...) So a 2 TFLOP node (DP) will likely require ~100GB of memory or so.
That's a circular argument. For a machine which can deals with a wide variety of problems, it's reasonable. For a more specialized machine, it can be huge overkill. For example,
a GTX 1080 w/ 9TFLOPS single-precision and 8GB, for a ratio of 1GB to 1125GFLOPS, which is
about 56x away from your figure. Or consider dedicated bitcoin-mining rigs, which have a
whole bunch of parallel-hashing ASICs and no DRAM (and in that case a fairly trivial interconnect such as a single USB link up to a master machine).
So you're assuming it's a machine which can handle the same kinds of problems as a big cluster of Xeon-based nodes, and then you're criticizing the ARM-based chips for not having
the same capabilities as Xeons. But this is a more specialized architecture optimized to give better price-performance on a narrow subset of problems. There'd be no point if it was the same - being different is what makes it interesting.
> Also, the networking at 10k nodes get very expensive, you aren't going to do that for $100 per
> node. A 48 port 10G switch will set you back 5k+ easy and you are going to need a lot of them,
> a whole whole lot of them depending on topology.
Again, you're making assumptions about what the network has to look like, and your assumptions are based on being able to run a wide variety of applications, some of which
have a high ratio of communication/compute.
This kind of system would probably look very different: first, it would have a large number
of nodes on each board, e.g. 16 or 32 nodes, possibly with some cheap local interconnect
(e.g. PCIe switch chips). PCIe can also go between boards in a rack, within reason.
But maybe you only target applications with sufficiently low communication/compute that
2 x 10Gbit out of a 4U box, or between racks, is enough.
> 100 48p 40g switches at 10-20k per. Total would be ~$2-3M for the switches. Most supercomputers
> end up spending roughly the same on networking as they do on the nodes.
Right, but the point of this is to build a system optimized for particular problems
which don't need everything that a rackful-of-Xeon's give you.
*If* you assume that you need an expensive interconnect, then the argument for using
unorthodox high-throughput-per-dollar or high-throughput-per-watt cpu's goes away.
So unorthodox cpu's are appropriate *only* for systems which also have an unorthodox
(cheaper, lower-bandwidth) interconnect *and* cheaper (fewer GB per TFLOPS) DRAM.
Which in turn means that it only works well for a subset of applications, but if CFD is what you need and it works for CFD, then it's all good.
> Well the TB+ DRAM if you followed the thread was supposed to be total system memory and
> pointing out that phone/tablet SOCs don't have ecc on memory, which would be a problem.
It's definitely nice to have ECC on DRAM, just as it is on servers. And phone SoCs don't
have the interconnect you would want either - so it (almost certainly) isn't going to be
an unmodified phone SoC. But it can be something quite closely related to parts of a phone
SoC/server SoC.
> As far as memory per node goes, general ratios you see are in the 1 GB of memory per 20 GFLOPs of performance
> modulo some baseline memory overhead requirements (aka if you are only doing 1 GFLOP, you probably sill
> need 1GB of memory...) So a 2 TFLOP node (DP) will likely require ~100GB of memory or so.
That's a circular argument. For a machine which can deals with a wide variety of problems, it's reasonable. For a more specialized machine, it can be huge overkill. For example,
a GTX 1080 w/ 9TFLOPS single-precision and 8GB, for a ratio of 1GB to 1125GFLOPS, which is
about 56x away from your figure. Or consider dedicated bitcoin-mining rigs, which have a
whole bunch of parallel-hashing ASICs and no DRAM (and in that case a fairly trivial interconnect such as a single USB link up to a master machine).
So you're assuming it's a machine which can handle the same kinds of problems as a big cluster of Xeon-based nodes, and then you're criticizing the ARM-based chips for not having
the same capabilities as Xeons. But this is a more specialized architecture optimized to give better price-performance on a narrow subset of problems. There'd be no point if it was the same - being different is what makes it interesting.
> Also, the networking at 10k nodes get very expensive, you aren't going to do that for $100 per
> node. A 48 port 10G switch will set you back 5k+ easy and you are going to need a lot of them,
> a whole whole lot of them depending on topology.
Again, you're making assumptions about what the network has to look like, and your assumptions are based on being able to run a wide variety of applications, some of which
have a high ratio of communication/compute.
This kind of system would probably look very different: first, it would have a large number
of nodes on each board, e.g. 16 or 32 nodes, possibly with some cheap local interconnect
(e.g. PCIe switch chips). PCIe can also go between boards in a rack, within reason.
But maybe you only target applications with sufficiently low communication/compute that
2 x 10Gbit out of a 4U box, or between racks, is enough.
> 100 48p 40g switches at 10-20k per. Total would be ~$2-3M for the switches. Most supercomputers
> end up spending roughly the same on networking as they do on the nodes.
Right, but the point of this is to build a system optimized for particular problems
which don't need everything that a rackful-of-Xeon's give you.
*If* you assume that you need an expensive interconnect, then the argument for using
unorthodox high-throughput-per-dollar or high-throughput-per-watt cpu's goes away.
So unorthodox cpu's are appropriate *only* for systems which also have an unorthodox
(cheaper, lower-bandwidth) interconnect *and* cheaper (fewer GB per TFLOPS) DRAM.
Which in turn means that it only works well for a subset of applications, but if CFD is what you need and it works for CFD, then it's all good.