By: forestlaughing (forestlaughing.delete@this.yahoo.com), October 15, 2012 8:57 am
Room: Moderated Discussions
> One of the things i
> really don't like in HPC since the whole "clusters take over vector processors"
> has taken place is that even the top machines on top500 are stuck with commodity
> DRAM+hardware managed caches. Everybody is complaining about the memory wall but
> something like RLDRAM-3 (AFAIK) has never been used outside networking
> equipment.
> Disregarding the physical implementation seems that a Cray T90 was a
> much better and balanced architecture for HPC.
The T90 was very balanced, given the state of technology at the time. It's not true that you could scale that architecture up to today, and get that same level of balance, even for $50million/per. T90 was very fast at compute, and had memory bandwidth enough to sustain full triadic operation. The sram memory also had very low latency FOR THAT ERA. A lot of the latency was hidden by vector loads. Scaling that design to today's technology simply doesn't work. There isn't enough space around a device, nor pins on the package to put the hundreds of memory channels on each processor. That technology was very good in an era when a processor filled up an entire board, and memory was on boards several inches away from the processor board.
Cray's successor, the X1 shows the trend away from such a design. Instead of a CPU on a board, it was able to pack a CPU on a multi-chip module, which included 4 high speed cache chips. The rest of the memory was high-bandwidth/high-latency rambus memory, and not enough to support triadic operation without the cache. The X1 was also a numa design, rather than a big flat SMP.
Even custom vector machines have converged into slight variations of what the server processors offer. The advantages of pulling functionality into a single chip are huge. The one thing that can't yet be pulled into the CPU chip, is main memory, at least not in sizes large enough to be super useful. Perhaps chip stacking + TSVs will make this possible.
> really don't like in HPC since the whole "clusters take over vector processors"
> has taken place is that even the top machines on top500 are stuck with commodity
> DRAM+hardware managed caches. Everybody is complaining about the memory wall but
> something like RLDRAM-3 (AFAIK) has never been used outside networking
> equipment.
> Disregarding the physical implementation seems that a Cray T90 was a
> much better and balanced architecture for HPC.
The T90 was very balanced, given the state of technology at the time. It's not true that you could scale that architecture up to today, and get that same level of balance, even for $50million/per. T90 was very fast at compute, and had memory bandwidth enough to sustain full triadic operation. The sram memory also had very low latency FOR THAT ERA. A lot of the latency was hidden by vector loads. Scaling that design to today's technology simply doesn't work. There isn't enough space around a device, nor pins on the package to put the hundreds of memory channels on each processor. That technology was very good in an era when a processor filled up an entire board, and memory was on boards several inches away from the processor board.
Cray's successor, the X1 shows the trend away from such a design. Instead of a CPU on a board, it was able to pack a CPU on a multi-chip module, which included 4 high speed cache chips. The rest of the memory was high-bandwidth/high-latency rambus memory, and not enough to support triadic operation without the cache. The X1 was also a numa design, rather than a big flat SMP.
Even custom vector machines have converged into slight variations of what the server processors offer. The advantages of pulling functionality into a single chip are huge. The one thing that can't yet be pulled into the CPU chip, is main memory, at least not in sizes large enough to be super useful. Perhaps chip stacking + TSVs will make this possible.



