The Direct Path Isn’t the Shortest
In Figure 2 is shown the timings of two page read cycles to a 800 Mbps Direct Rambus memory system using the lowest latency (“dash forty”) DRDRAM devices available (tCAC = 8). I included the timing for a lightly loaded Rambus channel (single 4 device RIMM) and heavily loaded (two 16 device RIMMs) to show how memory load out can affect performance. According to version 0.9 of Rambus’s RIMM module specification document, the delay to the last device in a 4 device RIMM can be as high as just under 1.25 ns (minus a tiny bit for the trailing unloaded traces), while for two 16 device RIMMs the signal time of flight to the last device can be as much as 4.12 ns (minus a smaller bit for the shorter trailing unloaded traces). Thus the difference in the round trip read latency between a lightly loaded and heavily loaded Rambus system can be easily as much as 2 x (4.12 – 1.25) = 5.74 ns, even without taking into account of the effect of four extra connector crossings and two extra transits of motherboard connecting traces.
In this example the CPU communicates with the chipset using a 133 MHz system bus. The address and control information is latched into the chipset, and then takes another 7.5 ns cycle for address decoding and arbitration for the memory controller. Translated addressing and control information must then be broadside loaded into multiple 8 bit shift registers, and then clocked out to the Rambus channel at 800 Mbps starting on the next available falling edge of the 400 MHz clock. If the channel runs either asynchronously or with no predetermined phase relationship to the system bus clock, an extra 2.5 or 5.0 ns clock cycle may be needed to double sample the address and control data across the interface between the two timing domains. Since a DRDRAM read operation is 16 bytes long, the memory controller will issue two back-to-back read operations needed to supply the 32 bytes requested by the CPU.
The two 10 ns long (1.25 ns x eight words) read command packets travel along the Rambus channel until they reach the target memory device. After a minimum tCAC latency of 20 ns the read data is returned on the Rambus data channel in the form of two 10 ns long data packets of 16 bytes each (eight 16 bit words each). When the data packets reach the memory controller the data is clocked serially into 16 eight-bit shift registers. The data is parallelized and then transferred to the system bus for return to the CPU. . Again, if the Rambus channel runs either asynchronously or with no predetermined phase relationship to the system bus clock, an extra 7.5 ns clock cycle will be needed to double sample the data across the interface between the two timing domains.
A further delay in generating the CPU read data packet is introduced by Direct Rambus’s lack of support for critical word first bursting. The memory controller can arrange to transmit the address of the 16 byte sub-block containing the critical word in the first read command packet, but still has to wait for the first data packet to be captured in its entirety before being guaranteed to have the critical word. In my example in Figure 2, the read latency is 11-1-1-1 cycles (82.5 ns) for the lightly loaded Rambus channel and 12-1-1-1 cycles (90 ns) for the heavily loaded Rambus channel. A bank read operation requires the activate command packets to precede the read command packets by 17.5 ns (tRCD = 7). This will add 2 or 3 more system bus cycles of latency to the access, depending on the exact details of system timing.
Be the first to discuss this article!