By: Iain McClatchie (iain-rwt.delete@this.mcclatchie.com), August 6, 2012 2:09 pm
Room: Moderated Discussions
Micron has a very nice spreadsheet online for figuring out DDR3L power dissipation. Maybe it seems unreasonable, but DDR3 is no longer interesting for new designs, as the voltage is just too high to get good performance.
You need to get specific power numbers off a data sheet, such as the DDR3L 4Gb one here:
http://media.digikey.com/PDF/Data%20Sheets/Micron%20Technology%20Inc%20PDFs/MT41K_1G4,512M8.pdf
Plug those numbers into the spreadsheet (why they don't give you accurate numbers in the first place I don't know). Then tweak the DDR3 Config (4 Gb, x4, -125, Fast PD exit) and System Config numbers (1.35 V, 800 MHz, burst 8, etc).
My first point: two-rank SDRAM systems are crazy. Take a look over on the right hand side of the System Config tab. It's got a nice diagram for termination power in the read and write cases.
In the case of reading from the DRAM, the responding DRAM burns 3.1 mW/pin, of which 1.4 mW gets burned in the termination resistor (not sure if you add those two). 8.3 mW/pin gets burned in the CPU termination. But 23.6 mW gets burned in the "passive" DRAM's termination.
In the case of writing to the DRAM, the CPU burns 5.9 mW/pin driving the data, the receiving DRAM burns 16.5 mW/pin in it's termination, and the "passive" DRAM still burns 23.6 mW/pin.
Clearly, the "passive" DRAM in a two-rank system is a power hog. If we keep bandwidth and capacity the same, and compare a single rank and two rank system, we see that the single rank system is way better:
576 bit wide = 144 packages in 2 ranks * 8bits, 4 Gb/package = 64 GB, 38.0 watts
576 bit wide = 144 packages in 1 rank * 4 bits, 4 Gb/package = 64 GB, 31.3 watts
The advantage of the first system is only configurability: you can install 32 GB at first, with the same bandwidth, and then add another 32 GB later.
Second question: how to deal with server configs.
I'll note that people don't put the same CPU SKUs into personal machines that they put into server machines. Intel and AMD would like to differentiate these SKUs so they can get server folks to pay more. So they can arrange for the server SKUs to come with more memory per CPU package, and be priced higher.
I'm not really sure what the market is like for low #core, high memory systems. I'd argue that if I want 256 GB in a system, it seems quite reasonable to put four processor packages in there. So you'd still have motherboards with 1, 2, 4, and 8 processor slots, it's just that each slot would now populate CPU and memory together. No difference logically, and the electrical environment would be quite similar, just with longer traces between the CPU pins and the socket/slot impedance ripple.
Still, if folks want to put 1 TB into a four-socket system, then the CPU will have to be packaged onto it's board with buffered memory, just as is done now. It will not be reasonable to put 576 DRAM packages onto a single board with the CPU, so the CPU thing will have to be a sandwich of 2 or 4 boards, only one of which carries the CPU and plugs into the socket. This is going to drive up the capacitance on the CPU's pins and will drive power dissipation up and performance down, just as it does today. I think it would probably be more useful for AMD/Intel to put these CPUs into a 4000 pin FBGA and stick with unbuffered memory and 2000-pin busses. Note that the CPU vendor gets to make that decision without affecting the socket interface. And users will decide if they'd rather have that configuration, or just go with more CPU sockets in the system.
Third question: does GDDR5 burn more power in termination in order to go faster?
Probably a bit more, but not vastly more. Transmission lines on PC boards tend to be around 50 ohms. There's no huge advantage in getting away from this impedance (40 Gb/s electrical links run on 50 ohm lines, or rather, between coupled pairs of such lines). You want to terminate the line with a 50 ohm termination, so that the wave just disappears into the termination and does not come back to haunt future bits. That 50 ohm termination can be as simple as 100 ohms to Vdd and 100 ohms to Vss, which then acts like a 200 ohm dead short and dissipates 9 milliwatts per pin. Or you can do something more clever, and have an actively regulated Vdd/2 rail and a 50 ohm resistor to that. On average, the highs and lows on the bus should average out, and you should end up with a tradeoff between the capacitance on this rail and the regulation current required. In practice, these regulators tend to burn a disconcertingly large amount of current, and I'm really not sure why. I'm sure they'll get better over time, and maybe there are already good ones with which I'm not yet familiar.
The bottom line is that termination power should scale with the number of pins and not pin speed.
-Iain
You need to get specific power numbers off a data sheet, such as the DDR3L 4Gb one here:
http://media.digikey.com/PDF/Data%20Sheets/Micron%20Technology%20Inc%20PDFs/MT41K_1G4,512M8.pdf
Plug those numbers into the spreadsheet (why they don't give you accurate numbers in the first place I don't know). Then tweak the DDR3 Config (4 Gb, x4, -125, Fast PD exit) and System Config numbers (1.35 V, 800 MHz, burst 8, etc).
My first point: two-rank SDRAM systems are crazy. Take a look over on the right hand side of the System Config tab. It's got a nice diagram for termination power in the read and write cases.
In the case of reading from the DRAM, the responding DRAM burns 3.1 mW/pin, of which 1.4 mW gets burned in the termination resistor (not sure if you add those two). 8.3 mW/pin gets burned in the CPU termination. But 23.6 mW gets burned in the "passive" DRAM's termination.
In the case of writing to the DRAM, the CPU burns 5.9 mW/pin driving the data, the receiving DRAM burns 16.5 mW/pin in it's termination, and the "passive" DRAM still burns 23.6 mW/pin.
Clearly, the "passive" DRAM in a two-rank system is a power hog. If we keep bandwidth and capacity the same, and compare a single rank and two rank system, we see that the single rank system is way better:
576 bit wide = 144 packages in 2 ranks * 8bits, 4 Gb/package = 64 GB, 38.0 watts
576 bit wide = 144 packages in 1 rank * 4 bits, 4 Gb/package = 64 GB, 31.3 watts
The advantage of the first system is only configurability: you can install 32 GB at first, with the same bandwidth, and then add another 32 GB later.
Second question: how to deal with server configs.
I'll note that people don't put the same CPU SKUs into personal machines that they put into server machines. Intel and AMD would like to differentiate these SKUs so they can get server folks to pay more. So they can arrange for the server SKUs to come with more memory per CPU package, and be priced higher.
I'm not really sure what the market is like for low #core, high memory systems. I'd argue that if I want 256 GB in a system, it seems quite reasonable to put four processor packages in there. So you'd still have motherboards with 1, 2, 4, and 8 processor slots, it's just that each slot would now populate CPU and memory together. No difference logically, and the electrical environment would be quite similar, just with longer traces between the CPU pins and the socket/slot impedance ripple.
Still, if folks want to put 1 TB into a four-socket system, then the CPU will have to be packaged onto it's board with buffered memory, just as is done now. It will not be reasonable to put 576 DRAM packages onto a single board with the CPU, so the CPU thing will have to be a sandwich of 2 or 4 boards, only one of which carries the CPU and plugs into the socket. This is going to drive up the capacitance on the CPU's pins and will drive power dissipation up and performance down, just as it does today. I think it would probably be more useful for AMD/Intel to put these CPUs into a 4000 pin FBGA and stick with unbuffered memory and 2000-pin busses. Note that the CPU vendor gets to make that decision without affecting the socket interface. And users will decide if they'd rather have that configuration, or just go with more CPU sockets in the system.
Third question: does GDDR5 burn more power in termination in order to go faster?
Probably a bit more, but not vastly more. Transmission lines on PC boards tend to be around 50 ohms. There's no huge advantage in getting away from this impedance (40 Gb/s electrical links run on 50 ohm lines, or rather, between coupled pairs of such lines). You want to terminate the line with a 50 ohm termination, so that the wave just disappears into the termination and does not come back to haunt future bits. That 50 ohm termination can be as simple as 100 ohms to Vdd and 100 ohms to Vss, which then acts like a 200 ohm dead short and dissipates 9 milliwatts per pin. Or you can do something more clever, and have an actively regulated Vdd/2 rail and a 50 ohm resistor to that. On average, the highs and lows on the bus should average out, and you should end up with a tradeoff between the capacitance on this rail and the regulation current required. In practice, these regulators tend to burn a disconcertingly large amount of current, and I'm really not sure why. I'm sure they'll get better over time, and maybe there are already good ones with which I'm not yet familiar.
The bottom line is that termination power should scale with the number of pins and not pin speed.
-Iain
Topic | Posted By | Date |
---|---|---|
New Article: Compute Efficiency 2012 | David Kanter | 2012/07/25 01:37 AM |
New Article: Compute Efficiency 2012 | SHK | 2012/07/25 02:31 AM |
New Article: Compute Efficiency 2012 | David Kanter | 2012/07/25 02:42 AM |
New Article: Compute Efficiency 2012 | none | 2012/07/25 03:18 AM |
New Article: Compute Efficiency 2012 | David Kanter | 2012/07/25 11:25 AM |
GCN (NT) | EBFE | 2012/07/25 03:25 AM |
GCN - TFLOP DP | jp | 2012/08/09 01:58 PM |
GCN - TFLOP DP | David Kanter | 2012/08/09 03:32 PM |
GCN - TFLOP DP | Kevin G | 2012/08/11 05:22 PM |
GCN - TFLOP DP | Eric | 2012/08/09 05:12 PM |
GCN - TFLOP DP | jp | 2012/08/10 01:23 AM |
GCN - TFLOP DP | EBFE | 2012/08/12 08:27 PM |
GCN - TFLOP DP | jp | 2012/08/13 02:02 AM |
GCN - TFLOP DP | EBFE | 2012/08/13 07:45 PM |
GCN - TFLOP DP | jp | 2012/08/14 01:21 AM |
New Article: Compute Efficiency 2012 | Adrian | 2012/07/25 04:39 AM |
New Article: Compute Efficiency 2012 | EBFE | 2012/07/25 09:33 AM |
New Article: Compute Efficiency 2012 | David Kanter | 2012/07/25 11:11 AM |
New Article: Compute Efficiency 2012 | sf | 2012/07/25 06:46 AM |
New Article: Compute Efficiency 2012 | aaron spink | 2012/07/25 09:08 AM |
New Article: Compute Efficiency 2012 | someone | 2012/07/25 10:06 AM |
New Article: Compute Efficiency 2012 | David Kanter | 2012/07/25 11:14 AM |
New Article: Compute Efficiency 2012 | EBFE | 2012/07/26 02:27 AM |
BG/Q | David Kanter | 2012/07/26 09:31 AM |
VR-ZONE KNC B0 leak, poor number? | EBFE | 2012/08/03 01:57 AM |
VR-ZONE KNC B0 leak, poor number? | Eric | 2012/08/03 07:59 AM |
VR-ZONE KNC B0 leak, poor number? | EBFE | 2012/08/04 06:37 AM |
VR-ZONE KNC B0 leak, poor number? | aaron spink | 2012/08/04 06:51 PM |
Leaks != products | David Kanter | 2012/08/05 03:19 AM |
Leaks != products | EBFE | 2012/08/06 02:49 AM |
VR-ZONE KNC B0 leak, poor number? | Eric | 2012/08/05 10:37 AM |
VR-ZONE KNC B0 leak, poor number? | EBFE | 2012/08/06 03:09 AM |
VR-ZONE KNC B0 leak, poor number? | aaron spink | 2012/08/06 04:33 AM |
VR-ZONE KNC B0 leak, poor number? | jp | 2012/08/07 03:08 AM |
VR-ZONE KNC B0 leak, poor number? | Eric | 2012/08/07 04:58 AM |
VR-ZONE KNC B0 leak, poor number? | jp | 2012/08/07 05:17 AM |
VR-ZONE KNC B0 leak, poor number? | Eric | 2012/08/07 05:22 AM |
VR-ZONE KNC B0 leak, poor number? | anonymou5 | 2012/08/07 09:43 AM |
VR-ZONE KNC B0 leak, poor number? | jp | 2012/08/07 05:23 AM |
VR-ZONE KNC B0 leak, poor number? | aaron spink | 2012/08/07 07:24 AM |
VR-ZONE KNC B0 leak, poor number? | aaron spink | 2012/08/07 07:20 AM |
VR-ZONE KNC B0 leak, poor number? | jp | 2012/08/07 11:22 AM |
VR-ZONE KNC B0 leak, poor number? | EduardoS | 2012/08/07 03:15 PM |
KNC has FMA | David Kanter | 2012/08/07 09:17 AM |
New Article: Compute Efficiency 2012 | forestlaughing | 2012/07/25 08:51 AM |
New Article: Compute Efficiency 2012 | Eric | 2012/07/27 05:12 AM |
New Article: Compute Efficiency 2012 | hobold | 2012/07/27 11:53 AM |
New Article: Compute Efficiency 2012 | Eric | 2012/07/27 12:51 PM |
New Article: Compute Efficiency 2012 | hobold | 2012/07/27 02:48 PM |
New Article: Compute Efficiency 2012 | Eric | 2012/07/27 03:29 PM |
New Article: Compute Efficiency 2012 | anon | 2012/07/29 02:25 AM |
New Article: Compute Efficiency 2012 | hobold | 2012/07/29 11:53 AM |
Efficiency? No, lack of highly useful features | someone | 2012/07/25 09:58 AM |
Best case for GPUs | David Kanter | 2012/07/25 11:28 AM |
Best case for GPUs | franzliszt | 2012/07/25 01:39 PM |
Best case for GPUs | Chuck | 2012/07/25 08:13 PM |
Best case for GPUs | David Kanter | 2012/07/25 09:45 PM |
Best case for GPUs | Eric | 2012/07/27 05:51 AM |
Silverthorn data point | Michael S | 2012/07/25 02:45 PM |
Silverthorn data point | David Kanter | 2012/07/25 04:06 PM |
New Article: Compute Efficiency 2012 | Unununium | 2012/07/25 05:55 PM |
New Article: Compute Efficiency 2012 | EduardoS | 2012/07/25 08:12 PM |
Ops... I'm wrong... | EduardoS | 2012/07/25 08:14 PM |
New Article: Compute Efficiency 2012 | TacoBell | 2012/07/25 08:36 PM |
New Article: Compute Efficiency 2012 | David Kanter | 2012/07/25 09:49 PM |
New Article: Compute Efficiency 2012 | Michael S | 2012/07/26 02:33 AM |
Line and factor | Moritz | 2012/07/26 01:34 AM |
Line and factor | Peter Boyle | 2012/07/27 07:57 AM |
not entirely | Moritz | 2012/07/27 12:22 PM |
Line and factor | EduardoS | 2012/07/27 05:24 PM |
Line and factor | Moritz | 2012/07/28 12:52 PM |
tables | Michael S | 2012/07/26 02:39 AM |
Interlagos L2+L3 | Rana | 2012/07/26 03:13 AM |
Interlagos L2+L3 | Rana | 2012/07/26 03:13 AM |
Interlagos L2+L3 | David Kanter | 2012/07/26 09:21 AM |
SP vs DP & performance metrics | jp | 2012/07/27 07:08 AM |
SP vs DP & performance metrics | Eric | 2012/07/27 07:57 AM |
SP vs DP & performance metrics | jp | 2012/07/27 09:18 AM |
SP vs DP & performance metrics | aaron spink | 2012/07/27 09:36 AM |
SP vs DP & performance metrics | jp | 2012/07/27 09:47 AM |
"Global" --> system | Paul A. Clayton | 2012/07/27 10:31 AM |
"Global" --> system | jp | 2012/07/27 03:55 PM |
"Global" --> system | aaron spink | 2012/07/27 07:33 PM |
"Global" --> system | jp | 2012/07/28 02:00 AM |
"Global" --> system | aaron spink | 2012/07/28 06:54 AM |
"Global" --> system | jp | 2012/07/29 02:12 AM |
"Global" --> system | aaron spink | 2012/07/29 05:03 AM |
"Global" --> system | none | 2012/07/29 09:05 AM |
"Global" --> system | EduardoS | 2012/07/29 10:26 AM |
"Global" --> system | jp | 2012/07/30 02:24 AM |
"Global" --> system | aaron spink | 2012/07/30 03:05 AM |
"Global" --> system | aaron spink | 2012/07/30 03:03 AM |
daxpy is STREAM TRIAD | Paul A. Clayton | 2012/07/30 06:10 AM |
SP vs DP & performance metrics | aaron spink | 2012/07/27 07:25 PM |
SP vs DP & performance metrics | Emil Briggs | 2012/07/28 06:40 AM |
SP vs DP & performance metrics | aaron spink | 2012/07/28 07:05 AM |
SP vs DP & performance metrics | jp | 2012/07/28 11:04 AM |
SP vs DP & performance metrics | Brett | 2012/07/28 03:32 PM |
SP vs DP & performance metrics | Emil Briggs | 2012/07/28 06:11 PM |
SP vs DP & performance metrics | anon | 2012/07/29 02:53 AM |
SP vs DP & performance metrics | aaron spink | 2012/07/29 05:39 AM |
Coherency for discretes | Rohit | 2012/07/29 09:24 AM |
SP vs DP & performance metrics | anon | 2012/07/29 11:09 AM |
SP vs DP & performance metrics | Eric | 2012/07/29 01:08 PM |
SP vs DP & performance metrics | aaron spink | 2012/07/27 09:25 AM |
Regular updates? | Joe | 2012/07/27 09:35 AM |
New Article: Compute Efficiency 2012 | 309 | 2012/07/27 10:34 PM |
New Article: Compute Efficiency 2012 | Ingeneer | 2012/07/30 09:01 AM |
New Article: Compute Efficiency 2012 | David Kanter | 2012/07/30 01:11 PM |
New Article: Compute Efficiency 2012 | Ingeneer | 2012/07/30 08:04 PM |
New Article: Compute Efficiency 2012 | David Kanter | 2012/07/30 09:32 PM |
Memory power and bandwidth? | Iain McClatchie | 2012/08/03 04:35 PM |
Memory power and bandwidth? | David Kanter | 2012/08/04 11:22 AM |
Memory power and bandwidth? | Michael S | 2012/08/04 02:36 PM |
Memory power and bandwidth? | Iain McClatchie | 2012/08/06 02:09 PM |
Memory power and bandwidth? | Eric | 2012/08/07 06:28 PM |
Workloads | David Kanter | 2012/08/08 10:49 AM |
Workloads | Eric | 2012/08/09 05:21 PM |
Latency and bandwidth bottlenecks | Paul A. Clayton | 2012/08/08 04:02 PM |
Latency and bandwidth bottlenecks | Eric | 2012/08/09 05:32 PM |
Latency and bandwidth bottlenecks | none | 2012/08/10 06:06 AM |
Latency and bandwidth bottlenecks -> BDP | ajensen | 2012/08/11 03:21 PM |
Memory power and bandwidth? | Ingeneer | 2012/08/06 11:26 AM |
NV aims for 1.8+ TFLOPS DP ? | jp | 2012/08/11 01:21 PM |
NV aims for 1.8+ TFLOPS DP ? | David Kanter | 2012/08/11 09:25 PM |
NV aims for 1.8+ TFLOPS DP ? | jp | 2012/08/12 02:45 AM |
NV aims for 1.8+ TFLOPS DP ? | EBFE | 2012/08/12 10:02 PM |
NV aims for 1.8+ TFLOPS DP ? | jp | 2012/08/13 01:54 AM |
NV aims for 1.8+ TFLOPS DP ? | Gabriele Svelto | 2012/08/13 09:16 AM |
NV aims for 1.8+ TFLOPS DP ? | Vincent Diepeveen | 2012/08/14 03:04 AM |
NV aims for 1.8+ TFLOPS DP ? | David Kanter | 2012/08/13 09:50 AM |
NV aims for 1.8+ TFLOPS DP ? | jp | 2012/08/13 11:17 AM |
NV aims for 1.8+ TFLOPS DP ? | EduardoS | 2012/08/13 06:45 AM |