By: David Kanter (dkanter.delete@this.realworldtech.com), August 4, 2012 11:22 am
Room: Moderated Discussions
Iain McClatchie (iain-rwt.delete@this.mcclatchie.com) on August 3, 2012 4:35 pm wrote:
> One of the big differences between CPUs and GPUs to me is their physical memory
> architecture.
>
> CPU physical memory architecture:
> CPUs come in an FBGA which
> you mount onto a motherboard with a nonsoldered really expensive socket. The
> DRAM for this system comes in FBGAs which are soldered to DIMMs, which then
> connect to the motherboard via the DIMM socket. It's usually possible to load
> two DIMMs per memory channel, and the CPU provides 1 clock pair per 8 DQs, and
> the CPU knows how to deal with registers between the CPU outputs and the DRAM
> chips. The pin data rate is something like 1 Gb/s/pin.
>
> This is good for
> configuring the system memory after the motherboard has been soldered together.
> This is bad for memory power dissipation (DQs are actively terminated and
> terminating 2 DRAM drops per CPU DQ pin consumes really large amounts of
> power).
Do you have any idea how much power DQ termination uses?
> GPU physical memory architecture:
> GPU comes in an FBGA which is
> soldered to the same board as the DRAM FBGAs. DQs are point-to-point with just
> two solder balls near the ends of the line. GPU provides 1 clock pair per 16
> DQs. The pin data rate is something like 4 Gb/s/pin.
>
> This is good for high
> bandwidth and low power, but it means you configure the memory when you solder
> everything down.
I've been told by folks who design both DDR3 and GDDR5 memory controllers that the latter is noticeably more power efficient when measured by pJ/bit. I suspect it is for many of the reasons you have outlined.
Of course, the catch is that GDDR5 latency is pretty awful. Part of that is architectural, but I suspect part is related to the things that make GDDR5 so energy efficient.
> My proposal:
> For many years now, it has seemed to me that
> CPUs should be sold as GPUs are sold, soldered onto little boards with their
> DRAM, with one-to-one data pins between CPU and DRAM. Any given CPU core/speed
> might be offered with 2 different memory loads. For example, you might be able
> to buy a 3 GHz, 4 core CPU with 8 GB or 16 GB of DRAM as a unit. This would
> double the number of SKUs shipped by CPU board manufacturers. The CPU board
> would plug into the motherboard as GPUs do now, and it's conceivable that you
> might be able to select between plugging in CPUs and GPUs.
>
> A 16 GB load of 2
> Gb chips is 72 DRAM chips, which can be implemented one-to-one with x8 DRAMs and
> 576 data pins. Obviously some of the lines will have to be somewhat long (7-8
> cm?), but I don't think that requires active termination. 32 GB/CPU package and
> larger configs would require x4 chips and buffering, and perhaps chip stacking
> for the really large memory loads.
>
> My guess is that GPUs (and their memories)
> burn much less IO power per data data bandwidth than CPUs. This proposal would
> bring CPUs up to par, and eliminate most of the expensive CPU and DIMM
> connections in the system, increasing system reliability and decreasing
> cost.
>
> From a business point-of-view, the combined product encapsulates quite
> a bit more of the high-cost portion of the system. It would lead to a big
> shakeup as DIMM and motherboard manufacturers duke it out to see who ends up
> being good at shipping a high-cost commodity with price-volatile components on
> it.
I think the other question is how would you handle servers? I think you can argue that the configuration options in modern systems is excessive and that sacrificing a bit to improve cost/power is reasonable. But would that kind of memory arrangement scale to servers where you want ~1TB of memory in the near future.
Or perhaps the real issue is that we need 3 different types of 'memory', one optimized for latency, one for bandwidth and one for capacity.
David
> One of the big differences between CPUs and GPUs to me is their physical memory
> architecture.
>
> CPU physical memory architecture:
> CPUs come in an FBGA which
> you mount onto a motherboard with a nonsoldered really expensive socket. The
> DRAM for this system comes in FBGAs which are soldered to DIMMs, which then
> connect to the motherboard via the DIMM socket. It's usually possible to load
> two DIMMs per memory channel, and the CPU provides 1 clock pair per 8 DQs, and
> the CPU knows how to deal with registers between the CPU outputs and the DRAM
> chips. The pin data rate is something like 1 Gb/s/pin.
>
> This is good for
> configuring the system memory after the motherboard has been soldered together.
> This is bad for memory power dissipation (DQs are actively terminated and
> terminating 2 DRAM drops per CPU DQ pin consumes really large amounts of
> power).
Do you have any idea how much power DQ termination uses?
> GPU physical memory architecture:
> GPU comes in an FBGA which is
> soldered to the same board as the DRAM FBGAs. DQs are point-to-point with just
> two solder balls near the ends of the line. GPU provides 1 clock pair per 16
> DQs. The pin data rate is something like 4 Gb/s/pin.
>
> This is good for high
> bandwidth and low power, but it means you configure the memory when you solder
> everything down.
I've been told by folks who design both DDR3 and GDDR5 memory controllers that the latter is noticeably more power efficient when measured by pJ/bit. I suspect it is for many of the reasons you have outlined.
Of course, the catch is that GDDR5 latency is pretty awful. Part of that is architectural, but I suspect part is related to the things that make GDDR5 so energy efficient.
> My proposal:
> For many years now, it has seemed to me that
> CPUs should be sold as GPUs are sold, soldered onto little boards with their
> DRAM, with one-to-one data pins between CPU and DRAM. Any given CPU core/speed
> might be offered with 2 different memory loads. For example, you might be able
> to buy a 3 GHz, 4 core CPU with 8 GB or 16 GB of DRAM as a unit. This would
> double the number of SKUs shipped by CPU board manufacturers. The CPU board
> would plug into the motherboard as GPUs do now, and it's conceivable that you
> might be able to select between plugging in CPUs and GPUs.
>
> A 16 GB load of 2
> Gb chips is 72 DRAM chips, which can be implemented one-to-one with x8 DRAMs and
> 576 data pins. Obviously some of the lines will have to be somewhat long (7-8
> cm?), but I don't think that requires active termination. 32 GB/CPU package and
> larger configs would require x4 chips and buffering, and perhaps chip stacking
> for the really large memory loads.
>
> My guess is that GPUs (and their memories)
> burn much less IO power per data data bandwidth than CPUs. This proposal would
> bring CPUs up to par, and eliminate most of the expensive CPU and DIMM
> connections in the system, increasing system reliability and decreasing
> cost.
>
> From a business point-of-view, the combined product encapsulates quite
> a bit more of the high-cost portion of the system. It would lead to a big
> shakeup as DIMM and motherboard manufacturers duke it out to see who ends up
> being good at shipping a high-cost commodity with price-volatile components on
> it.
I think the other question is how would you handle servers? I think you can argue that the configuration options in modern systems is excessive and that sacrificing a bit to improve cost/power is reasonable. But would that kind of memory arrangement scale to servers where you want ~1TB of memory in the near future.
Or perhaps the real issue is that we need 3 different types of 'memory', one optimized for latency, one for bandwidth and one for capacity.
David
Topic | Posted By | Date |
---|---|---|
New Article: Compute Efficiency 2012 | David Kanter | 2012/07/25 01:37 AM |
New Article: Compute Efficiency 2012 | SHK | 2012/07/25 02:31 AM |
New Article: Compute Efficiency 2012 | David Kanter | 2012/07/25 02:42 AM |
New Article: Compute Efficiency 2012 | none | 2012/07/25 03:18 AM |
New Article: Compute Efficiency 2012 | David Kanter | 2012/07/25 11:25 AM |
GCN (NT) | EBFE | 2012/07/25 03:25 AM |
GCN - TFLOP DP | jp | 2012/08/09 01:58 PM |
GCN - TFLOP DP | David Kanter | 2012/08/09 03:32 PM |
GCN - TFLOP DP | Kevin G | 2012/08/11 05:22 PM |
GCN - TFLOP DP | Eric | 2012/08/09 05:12 PM |
GCN - TFLOP DP | jp | 2012/08/10 01:23 AM |
GCN - TFLOP DP | EBFE | 2012/08/12 08:27 PM |
GCN - TFLOP DP | jp | 2012/08/13 02:02 AM |
GCN - TFLOP DP | EBFE | 2012/08/13 07:45 PM |
GCN - TFLOP DP | jp | 2012/08/14 01:21 AM |
New Article: Compute Efficiency 2012 | Adrian | 2012/07/25 04:39 AM |
New Article: Compute Efficiency 2012 | EBFE | 2012/07/25 09:33 AM |
New Article: Compute Efficiency 2012 | David Kanter | 2012/07/25 11:11 AM |
New Article: Compute Efficiency 2012 | sf | 2012/07/25 06:46 AM |
New Article: Compute Efficiency 2012 | aaron spink | 2012/07/25 09:08 AM |
New Article: Compute Efficiency 2012 | someone | 2012/07/25 10:06 AM |
New Article: Compute Efficiency 2012 | David Kanter | 2012/07/25 11:14 AM |
New Article: Compute Efficiency 2012 | EBFE | 2012/07/26 02:27 AM |
BG/Q | David Kanter | 2012/07/26 09:31 AM |
VR-ZONE KNC B0 leak, poor number? | EBFE | 2012/08/03 01:57 AM |
VR-ZONE KNC B0 leak, poor number? | Eric | 2012/08/03 07:59 AM |
VR-ZONE KNC B0 leak, poor number? | EBFE | 2012/08/04 06:37 AM |
VR-ZONE KNC B0 leak, poor number? | aaron spink | 2012/08/04 06:51 PM |
Leaks != products | David Kanter | 2012/08/05 03:19 AM |
Leaks != products | EBFE | 2012/08/06 02:49 AM |
VR-ZONE KNC B0 leak, poor number? | Eric | 2012/08/05 10:37 AM |
VR-ZONE KNC B0 leak, poor number? | EBFE | 2012/08/06 03:09 AM |
VR-ZONE KNC B0 leak, poor number? | aaron spink | 2012/08/06 04:33 AM |
VR-ZONE KNC B0 leak, poor number? | jp | 2012/08/07 03:08 AM |
VR-ZONE KNC B0 leak, poor number? | Eric | 2012/08/07 04:58 AM |
VR-ZONE KNC B0 leak, poor number? | jp | 2012/08/07 05:17 AM |
VR-ZONE KNC B0 leak, poor number? | Eric | 2012/08/07 05:22 AM |
VR-ZONE KNC B0 leak, poor number? | anonymou5 | 2012/08/07 09:43 AM |
VR-ZONE KNC B0 leak, poor number? | jp | 2012/08/07 05:23 AM |
VR-ZONE KNC B0 leak, poor number? | aaron spink | 2012/08/07 07:24 AM |
VR-ZONE KNC B0 leak, poor number? | aaron spink | 2012/08/07 07:20 AM |
VR-ZONE KNC B0 leak, poor number? | jp | 2012/08/07 11:22 AM |
VR-ZONE KNC B0 leak, poor number? | EduardoS | 2012/08/07 03:15 PM |
KNC has FMA | David Kanter | 2012/08/07 09:17 AM |
New Article: Compute Efficiency 2012 | forestlaughing | 2012/07/25 08:51 AM |
New Article: Compute Efficiency 2012 | Eric | 2012/07/27 05:12 AM |
New Article: Compute Efficiency 2012 | hobold | 2012/07/27 11:53 AM |
New Article: Compute Efficiency 2012 | Eric | 2012/07/27 12:51 PM |
New Article: Compute Efficiency 2012 | hobold | 2012/07/27 02:48 PM |
New Article: Compute Efficiency 2012 | Eric | 2012/07/27 03:29 PM |
New Article: Compute Efficiency 2012 | anon | 2012/07/29 02:25 AM |
New Article: Compute Efficiency 2012 | hobold | 2012/07/29 11:53 AM |
Efficiency? No, lack of highly useful features | someone | 2012/07/25 09:58 AM |
Best case for GPUs | David Kanter | 2012/07/25 11:28 AM |
Best case for GPUs | franzliszt | 2012/07/25 01:39 PM |
Best case for GPUs | Chuck | 2012/07/25 08:13 PM |
Best case for GPUs | David Kanter | 2012/07/25 09:45 PM |
Best case for GPUs | Eric | 2012/07/27 05:51 AM |
Silverthorn data point | Michael S | 2012/07/25 02:45 PM |
Silverthorn data point | David Kanter | 2012/07/25 04:06 PM |
New Article: Compute Efficiency 2012 | Unununium | 2012/07/25 05:55 PM |
New Article: Compute Efficiency 2012 | EduardoS | 2012/07/25 08:12 PM |
Ops... I'm wrong... | EduardoS | 2012/07/25 08:14 PM |
New Article: Compute Efficiency 2012 | TacoBell | 2012/07/25 08:36 PM |
New Article: Compute Efficiency 2012 | David Kanter | 2012/07/25 09:49 PM |
New Article: Compute Efficiency 2012 | Michael S | 2012/07/26 02:33 AM |
Line and factor | Moritz | 2012/07/26 01:34 AM |
Line and factor | Peter Boyle | 2012/07/27 07:57 AM |
not entirely | Moritz | 2012/07/27 12:22 PM |
Line and factor | EduardoS | 2012/07/27 05:24 PM |
Line and factor | Moritz | 2012/07/28 12:52 PM |
tables | Michael S | 2012/07/26 02:39 AM |
Interlagos L2+L3 | Rana | 2012/07/26 03:13 AM |
Interlagos L2+L3 | Rana | 2012/07/26 03:13 AM |
Interlagos L2+L3 | David Kanter | 2012/07/26 09:21 AM |
SP vs DP & performance metrics | jp | 2012/07/27 07:08 AM |
SP vs DP & performance metrics | Eric | 2012/07/27 07:57 AM |
SP vs DP & performance metrics | jp | 2012/07/27 09:18 AM |
SP vs DP & performance metrics | aaron spink | 2012/07/27 09:36 AM |
SP vs DP & performance metrics | jp | 2012/07/27 09:47 AM |
"Global" --> system | Paul A. Clayton | 2012/07/27 10:31 AM |
"Global" --> system | jp | 2012/07/27 03:55 PM |
"Global" --> system | aaron spink | 2012/07/27 07:33 PM |
"Global" --> system | jp | 2012/07/28 02:00 AM |
"Global" --> system | aaron spink | 2012/07/28 06:54 AM |
"Global" --> system | jp | 2012/07/29 02:12 AM |
"Global" --> system | aaron spink | 2012/07/29 05:03 AM |
"Global" --> system | none | 2012/07/29 09:05 AM |
"Global" --> system | EduardoS | 2012/07/29 10:26 AM |
"Global" --> system | jp | 2012/07/30 02:24 AM |
"Global" --> system | aaron spink | 2012/07/30 03:05 AM |
"Global" --> system | aaron spink | 2012/07/30 03:03 AM |
daxpy is STREAM TRIAD | Paul A. Clayton | 2012/07/30 06:10 AM |
SP vs DP & performance metrics | aaron spink | 2012/07/27 07:25 PM |
SP vs DP & performance metrics | Emil Briggs | 2012/07/28 06:40 AM |
SP vs DP & performance metrics | aaron spink | 2012/07/28 07:05 AM |
SP vs DP & performance metrics | jp | 2012/07/28 11:04 AM |
SP vs DP & performance metrics | Brett | 2012/07/28 03:32 PM |
SP vs DP & performance metrics | Emil Briggs | 2012/07/28 06:11 PM |
SP vs DP & performance metrics | anon | 2012/07/29 02:53 AM |
SP vs DP & performance metrics | aaron spink | 2012/07/29 05:39 AM |
Coherency for discretes | Rohit | 2012/07/29 09:24 AM |
SP vs DP & performance metrics | anon | 2012/07/29 11:09 AM |
SP vs DP & performance metrics | Eric | 2012/07/29 01:08 PM |
SP vs DP & performance metrics | aaron spink | 2012/07/27 09:25 AM |
Regular updates? | Joe | 2012/07/27 09:35 AM |
New Article: Compute Efficiency 2012 | 309 | 2012/07/27 10:34 PM |
New Article: Compute Efficiency 2012 | Ingeneer | 2012/07/30 09:01 AM |
New Article: Compute Efficiency 2012 | David Kanter | 2012/07/30 01:11 PM |
New Article: Compute Efficiency 2012 | Ingeneer | 2012/07/30 08:04 PM |
New Article: Compute Efficiency 2012 | David Kanter | 2012/07/30 09:32 PM |
Memory power and bandwidth? | Iain McClatchie | 2012/08/03 04:35 PM |
Memory power and bandwidth? | David Kanter | 2012/08/04 11:22 AM |
Memory power and bandwidth? | Michael S | 2012/08/04 02:36 PM |
Memory power and bandwidth? | Iain McClatchie | 2012/08/06 02:09 PM |
Memory power and bandwidth? | Eric | 2012/08/07 06:28 PM |
Workloads | David Kanter | 2012/08/08 10:49 AM |
Workloads | Eric | 2012/08/09 05:21 PM |
Latency and bandwidth bottlenecks | Paul A. Clayton | 2012/08/08 04:02 PM |
Latency and bandwidth bottlenecks | Eric | 2012/08/09 05:32 PM |
Latency and bandwidth bottlenecks | none | 2012/08/10 06:06 AM |
Latency and bandwidth bottlenecks -> BDP | ajensen | 2012/08/11 03:21 PM |
Memory power and bandwidth? | Ingeneer | 2012/08/06 11:26 AM |
NV aims for 1.8+ TFLOPS DP ? | jp | 2012/08/11 01:21 PM |
NV aims for 1.8+ TFLOPS DP ? | David Kanter | 2012/08/11 09:25 PM |
NV aims for 1.8+ TFLOPS DP ? | jp | 2012/08/12 02:45 AM |
NV aims for 1.8+ TFLOPS DP ? | EBFE | 2012/08/12 10:02 PM |
NV aims for 1.8+ TFLOPS DP ? | jp | 2012/08/13 01:54 AM |
NV aims for 1.8+ TFLOPS DP ? | Gabriele Svelto | 2012/08/13 09:16 AM |
NV aims for 1.8+ TFLOPS DP ? | Vincent Diepeveen | 2012/08/14 03:04 AM |
NV aims for 1.8+ TFLOPS DP ? | David Kanter | 2012/08/13 09:50 AM |
NV aims for 1.8+ TFLOPS DP ? | jp | 2012/08/13 11:17 AM |
NV aims for 1.8+ TFLOPS DP ? | EduardoS | 2012/08/13 06:45 AM |