Direct Rambus DRAM, Part 2 – Operation and Performance

Pages: 1 2 3 4 5 6 7

Row, Row, Row Your Propaganda Gently Down the Stream

My second example I took from the Intel web site. It is an article entitled “The Intel 820 Chipset and the Next-Generation Performance Platform” attributed to Frank Hady, a senior staff systems architect at the Intel Architecture Labs. It makes extensive use of nifty two-dimensional graphs, that show system bandwidth capability in terms of bandwidth available to the CPU (y-axis) as a function of the bandwidth consumed by the AGP port (x-axis). The peak memory bandwidth in an 820 system is up to twice as large the CPU bus bandwidth, while in a 440BX system the CPU bus bandwidth and memory bandwidth are largely in balance. It is therefore not surprising that Intel’s pretty graphs show that as you stress the 440BX system with higher AGP throughput, the bandwidth available to the CPU falls off while the 820 system shows higher initial CPU bandwidth (remember that 820 system runs its CPU bus as 133 MHz versus 100 MHz for the 440BX), which remains level until extremely high AGP bandwidth levels are reached.

So what the heck does that tell the user how fast his programs will run, or how Rambus’s higher latency affects performance? Virtually nothing. As my simple model shows, the average bandwidth needed by most programs is rather modest. On small time scales the program will occasionally consume much higher levels of bandwidth but the PC100 SDRAM based system still has plenty of headroom (and if not there is PC133 and DDR). Does Rambus’s ability to better satisfy the occasional microburst of CPU memory activity outweigh its 20+ % greater latency? Very unlikely for most programs.

The Hady article does make a disingenuous effort to address the latency issue. It includes a carefully crafted example comparing the latency distribution of memory accesses for 440BX and 820 based systems using 400 MHz Pentium II processors to run the Triad portion of John McCalpin’s Stream memory bandwidth benchmark. First of all, the 440BX system is outfitted with CAS latency three (CL3) PC100 memory. It is not easy to find PC100 memory that slow these days so Intel should be given an E for effort and an F for ethics in that regards. Not only that, but apparently the 440BX is configured to add an additional clock cycle of latency (possibly by increasing the chip enable assertion leadoff time) to arrive at a minimum page read latency of 90 ns. Intel’s own data sheet for the 440BX chipset states that it can run with 7 cycle, 70 ns page hit operations using CL2 PC100 SDRAM (page 4-23).

So, having established that Intel had handicapped the 440BX by 20 ns in the latency test, let’s have a look at the benchmark. The Stream benchmark is a program written by John D. McCalpin, a high-performance computing researcher who is a jokingly self-professed “Bandwidth Bigot”. The heart of the Triad portion of the C language Stream benchmark is the following inner loop:

 for (j=0; j &lt N; j++)
 a[j] = b[j]+scalar*c[j];

where a [], b [], and c [] are each arrays of one million double precision floating point values. This program is strongly atypical of most computer applications. What it does is repeatedly sequentially stomp over a 24-million byte region in main memory. Every four iterations through the loop, four 32 byte burst transfers are performed. Two to load a cache line worth of b [] and c [] elements, one to write back a victim dirty cache line of modified a[] elements, and one to fetch a new cache line of a[] elements to be overwritten (assuming a write allocation caching policy). Assuming the inner loop ran at four clocks per iteration the main memory would have to transfer 128 bytes every 40 ns (a sustainable bandwidth of 3.2 Gbyte/s!) to keep up with the 400 MHz processor. This is a funny way to compare memory latencies!

The latency distribution graph Hady provides show that the DRDRAM latency running Streams Triad ranges from 82.5 ns (single RIMM system?) to 165 ns, with the bulk of accesses below 130 ns. The handicapped PC100 system shows latency from 90 ns to 240 ns, with the bulk of accesses in 170 ns range. The distribution of latencies in the 440BX system also suggests that the most efficacious open page management policy was not used. This benchmark is unfair and misleading for three main reasons – the 440BX system was sandbagged, the choice of the Streams benchmark is biased in favour of the 820’s faster system bus, and the fact that Streams shows performance characteristics completely different from the vast majority of applications run on personal computers. Buyer beware!


Pages: « Prev  1 2 3 4 5 6 7  

Be the first to discuss this article!