By: Aaron Spink (aaronspink.delete@this.notearthlink.net), January 24, 2017 12:01 pm
Room: Moderated Discussions
RichardC (tich.delete@this.pobox.com) on January 24, 2017 9:34 am wrote:
> Aaron Spink (aaronspink.delete@this.notearthlink.net) on January 24, 2017 7:20 am wrote:
>
> > Well the TB+ DRAM if you followed the thread was supposed to be total system memory and
> > pointing out that phone/tablet SOCs don't have ecc on memory, which would be a problem.
>
> It's definitely nice to have ECC on DRAM, just as it is on servers. And phone SoCs don't
> have the interconnect you would want either - so it (almost certainly) isn't going to be
> an unmodified phone SoC. But it can be something quite closely related to parts of a phone
> SoC/server SoC.
>
If it isn't a phone/tablet SoC then it has no shared costs with them and will cost as much as any Xeon if not more. Even the high end Xeons are off a die that has 1M+ volume.
> > As far as memory per node goes, general ratios you see are
> > in the 1 GB of memory per 20 GFLOPs of performance
> > modulo some baseline memory overhead requirements (aka if you are only doing 1 GFLOP, you probably sill
> > need 1GB of memory...) So a 2 TFLOP node (DP) will likely require ~100GB of memory or so.
>
> That's a circular argument. For a machine which can deals with a wide variety of problems,
> it's reasonable. For a more specialized machine, it can be huge overkill. For example,
> a GTX 1080 w/ 9TFLOPS single-precision and 8GB, for a ratio of 1GB to 1125GFLOPS, which is
> about 56x away from your figure. Or consider dedicated bitcoin-mining rigs, which have a
> whole bunch of parallel-hashing ASICs and no DRAM (and in that case a fairly
> trivial interconnect such as a single USB link up to a master machine).
>
And those computers using Tesla P100s (not 1080s which lack ECC and have poor DP) are connected to cpus with 100s of GB of dram. They are constantly stream data in and out of the local memory.
> So you're assuming it's a machine which can handle the same kinds of problems as a big cluster
> of Xeon-based nodes, and then you're criticizing the ARM-based chips for not having
> the same capabilities as Xeons. But this is a more specialized architecture optimized
> to give better price-performance on a narrow subset of problems. There'd be no
> point if it was the same - being different is what makes it interesting.
>
Being different is what makes it extremely niche with low volume. That's not the market you want to try to make money in, not when you are competing against full featured Xeons, GPUs, and Xeon Phi. I highly doubt the new Mont Blanc machine is going to skimp on memory.
> > Also, the networking at 10k nodes get very expensive, you aren't going to do that for $100 per
> > node. A 48 port 10G switch will set you back 5k+ easy and you are going to need a lot of them,
> > a whole whole lot of them depending on topology.
>
> Again, you're making assumptions about what the network has to look like, and your assumptions
> are based on being able to run a wide variety of applications, some of which
> have a high ratio of communication/compute.
>
> This kind of system would probably look very different: first, it would have a large number
> of nodes on each board, e.g. 16 or 32 nodes, possibly with some cheap local interconnect
> (e.g. PCIe switch chips). PCIe can also go between boards in a rack, within reason.
> But maybe you only target applications with sufficiently low communication/compute that
> 2 x 10Gbit out of a 4U box, or between racks, is enough.
>
That's a vanishingly small subset of applications with that low of communication. Outside of crypto mining, you are unlikely to ever see it.
> Right, but the point of this is to build a system optimized for particular problems
> which don't need everything that a rackful-of-Xeon's give you.
>
> *If* you assume that you need an expensive interconnect, then the argument for using
> unorthodox high-throughput-per-dollar or high-throughput-per-watt cpu's goes away.
> So unorthodox cpu's are appropriate *only* for systems which also have an unorthodox
> (cheaper, lower-bandwidth) interconnect *and* cheaper (fewer GB per TFLOPS) DRAM.
> Which in turn means that it only works well for a subset of applications,
> but if CFD is what you need and it works for CFD, then it's all good.
>
So basically, you want to build a pure linpack machine. Even CFD requires more communication than that. For something like CFD, your comm requirements go up as you go to smaller work per node because you need to exchange more boundries more often. The reality is that most supers are already network limited for a large part of the application space. Linpack tends to be pretty much an absolute best case because it communications needs are lower than almost all applications. There is a reason something like Xeon Phi is available with 200Gb/s per node. The trend is that more and more problems are running into severe comm bottlenecks and that is what's driving the next round of supers into the 400Gb/s per node range. Hell, 10gb isn't even enough for cloud providers these days, they are all moving or have moved to 25/40/50/100.
> Aaron Spink (aaronspink.delete@this.notearthlink.net) on January 24, 2017 7:20 am wrote:
>
> > Well the TB+ DRAM if you followed the thread was supposed to be total system memory and
> > pointing out that phone/tablet SOCs don't have ecc on memory, which would be a problem.
>
> It's definitely nice to have ECC on DRAM, just as it is on servers. And phone SoCs don't
> have the interconnect you would want either - so it (almost certainly) isn't going to be
> an unmodified phone SoC. But it can be something quite closely related to parts of a phone
> SoC/server SoC.
>
If it isn't a phone/tablet SoC then it has no shared costs with them and will cost as much as any Xeon if not more. Even the high end Xeons are off a die that has 1M+ volume.
> > As far as memory per node goes, general ratios you see are
> > in the 1 GB of memory per 20 GFLOPs of performance
> > modulo some baseline memory overhead requirements (aka if you are only doing 1 GFLOP, you probably sill
> > need 1GB of memory...) So a 2 TFLOP node (DP) will likely require ~100GB of memory or so.
>
> That's a circular argument. For a machine which can deals with a wide variety of problems,
> it's reasonable. For a more specialized machine, it can be huge overkill. For example,
> a GTX 1080 w/ 9TFLOPS single-precision and 8GB, for a ratio of 1GB to 1125GFLOPS, which is
> about 56x away from your figure. Or consider dedicated bitcoin-mining rigs, which have a
> whole bunch of parallel-hashing ASICs and no DRAM (and in that case a fairly
> trivial interconnect such as a single USB link up to a master machine).
>
And those computers using Tesla P100s (not 1080s which lack ECC and have poor DP) are connected to cpus with 100s of GB of dram. They are constantly stream data in and out of the local memory.
> So you're assuming it's a machine which can handle the same kinds of problems as a big cluster
> of Xeon-based nodes, and then you're criticizing the ARM-based chips for not having
> the same capabilities as Xeons. But this is a more specialized architecture optimized
> to give better price-performance on a narrow subset of problems. There'd be no
> point if it was the same - being different is what makes it interesting.
>
Being different is what makes it extremely niche with low volume. That's not the market you want to try to make money in, not when you are competing against full featured Xeons, GPUs, and Xeon Phi. I highly doubt the new Mont Blanc machine is going to skimp on memory.
> > Also, the networking at 10k nodes get very expensive, you aren't going to do that for $100 per
> > node. A 48 port 10G switch will set you back 5k+ easy and you are going to need a lot of them,
> > a whole whole lot of them depending on topology.
>
> Again, you're making assumptions about what the network has to look like, and your assumptions
> are based on being able to run a wide variety of applications, some of which
> have a high ratio of communication/compute.
>
> This kind of system would probably look very different: first, it would have a large number
> of nodes on each board, e.g. 16 or 32 nodes, possibly with some cheap local interconnect
> (e.g. PCIe switch chips). PCIe can also go between boards in a rack, within reason.
> But maybe you only target applications with sufficiently low communication/compute that
> 2 x 10Gbit out of a 4U box, or between racks, is enough.
>
That's a vanishingly small subset of applications with that low of communication. Outside of crypto mining, you are unlikely to ever see it.
> Right, but the point of this is to build a system optimized for particular problems
> which don't need everything that a rackful-of-Xeon's give you.
>
> *If* you assume that you need an expensive interconnect, then the argument for using
> unorthodox high-throughput-per-dollar or high-throughput-per-watt cpu's goes away.
> So unorthodox cpu's are appropriate *only* for systems which also have an unorthodox
> (cheaper, lower-bandwidth) interconnect *and* cheaper (fewer GB per TFLOPS) DRAM.
> Which in turn means that it only works well for a subset of applications,
> but if CFD is what you need and it works for CFD, then it's all good.
>
So basically, you want to build a pure linpack machine. Even CFD requires more communication than that. For something like CFD, your comm requirements go up as you go to smaller work per node because you need to exchange more boundries more often. The reality is that most supers are already network limited for a large part of the application space. Linpack tends to be pretty much an absolute best case because it communications needs are lower than almost all applications. There is a reason something like Xeon Phi is available with 200Gb/s per node. The trend is that more and more problems are running into severe comm bottlenecks and that is what's driving the next round of supers into the 400Gb/s per node range. Hell, 10gb isn't even enough for cloud providers these days, they are all moving or have moved to 25/40/50/100.