Direct Rambus DRAM, Part 2 – Operation and Performance

Pages: 1 2 3 4 5 6 7

What Does Higher Latency Do to Performance?
A Simple Model

The big question computer buyers want to know is how does the choice of SDRAM vs DRDRAM memory affect performance. To help answer this I created a simple performance model based around a hypothetical 800 MHz CPU with an architectural average clocks per instruction (CPI) figure of 0.5. That is, a CPU which can average two instructions per clock cycle running strictly out of L1 caches. This CPU has 32 Kbyte L1 caches, and either 256 Kbyte or 128 Kbyte on-chip L2 cache with 6 clock cycle latency. My spreadsheet model is summarized in the table below:

Memory Type

Bus Freq (MHz)

On-chip L2 (KB)

Avg DRAM Access (CPU clks)

Avg Mem Access (CPU clks)

Average CPI

Average MIPS

Average DRAM BW (MB/s)

Normalized Performance (PC100 = 1.0)

PC100
SDRAM
(CL 2)

100

256
128

73.8

0.505
0.627

0.753
0.814

1063
983

109
138

1.00
1.0

PC133
SDRAM
(CL 3)

133

256
128

67.2

0.474
0.584

0.737
0.792

1086
1010

111
142

1.02
1.03

PC133
SDRAM
(CL 2)

133

256
128

61.2

0.445
0.544

0.722
0.772

1107
1036

113
146

1.04
1.05

DRDRAM
800
(4 devices)

133

256
128

85.2

0.560
0.702

0.780
0.851

1026
940

105
132

0.97
0.96

DRDRAM
800
(32 devices)

133

256
128

91.2

0.589
0.742

0.794
0.871

1007
918

103
129

0.95
0.93

In my model the L1 hit ratio is 97%, the L2 hit ratio is 84% and 78% respectively and the main memory page hit ratio is 55%. These hit ratios are taken from a 1998 presentation by Forrest Norrod, senior director, Cyrix Corp. entitled “The Future of CPU Bus Architectures – A Cyrix Perspective”. The column marked ‘average DRAM access’ refers to average critical word first latency in CPU clocks, plus 6 cycles for the L2 miss, and an extra cycle for data forwarding. The average latency is calculated based on 55% page hits, 22.5% row hits, and 22.5% page misses. The column labeled ‘average memory access’ is the average DRAM access multiplied by the L1 and L2 cache miss rates. The average CPI is calculated by adding the 50% of the average access time (since about half of x86 instructions perform a data memory access) to the base architectural figure of 0.5 CPI. The average MIPs is calculated by dividing 800 MHz by the average CPI figure. The average DRAM BW figure is derived as the product of MIPs x 50% data accesses x L1 miss rate x L2 miss rate x 32 bytes per cache line x 1.33 (66% reads, 33% writes with write miss allocate policy selected).

Although this simplistic model ignores many second order effects (out of order execution, I-cache misses, hit under miss caches, read-to-write and write-to-read switch over effects, refresh, DMA interference etc.), it is still useful to illustrate how all these design parameters interact for representative PC-type applications. The bottom line is that regardless of memory type, main memory latency is nearly two orders of magnitude larger than processor clock period, and an effective cache hierarchy is needed for good performance. In my example, DRDRAM has an average read latency over 20% greater than PC100 SDRAM, yet the CPU performance is only reduced by about 5% and 7% for a 256 Kbyte and 128 Kbyte L2 cache, respectively.

Of course individual programs can and will vary greatly in how memory characteristics impact their performance. A program that chases long chains of linked list records through a large memory footprint will thrash the caches, and the low latency of SDRAM will really shine. On the other hand large sequential memory transfers with little computation can easily saturate SDRAM bandwidth, and Direct Rambus will have an advantage (largely due to the faster system bus). For code that plays nicely within the caches, the memory type will have virtually no impact at all.


Pages: « Prev   1 2 3 4 5 6 7   Next »

Be the first to discuss this article!