By: --- (---.delete@this.redheron.com), December 8, 2022 10:44 am
Room: Moderated Discussions
Simon Farnsworth (simon.delete@this.farnz.org.uk) on December 8, 2022 7:22 am wrote:
> Etienne (etienne_lorrain.delete@this.yahoo.fr) on December 8, 2022 6:20 am wrote:
> > Looks like my AMD Ryzen 9 7950x has a L3 cache bandwidth of 63.9 GB/s, my current DRAM DDR5
> > has either 49.6 GB/s (Jedec) or 52.5 GB/s (AMD Expo) measured by memtest86 UEFI.
> > It seems some companies are increasing DRAM bandwidth: 8Gbps DDR5.
> >
> > I assume latency to L3 cache is still probably better than latency to
> > DRAM, but in simple terms, do we still need L3 cache in processors?
>
> A critical difference between peak DRAM throughput, and L3 throughput, is that L3 throughput is independent
> of access pattern (as long as you never leave L3, of course) - you get the same throughput from L3 whether
> you read cachelines sequentially, or whether you read cachelines in a random order, and you get the same
> throughput when writing whether you write sequentially, or whether you write in a random order. There's
> also no penalty for mixing writes and reads - the timings are the same for read then read another line,
> as for read then write and for write then read and for write then write another line.
>
> DDR5 doesn't offer that - your throughput is lower if you read or write 64 byte chunks at
> random throughput the chip than if you arrange to stay in the same bank group as much as
> possible. There's also a small penalty for mixing reads and writes, so you benefit from L3
> if it lets you do more writes in sequence before switching back to reads or vice-versa.
Not really. This may be the case with lousy designs but
- L3 will be banked so (in theory) a worst-case pattern will hammer a single bank...
BUT in reality this is not an issue because the addresses will be hashed before being distributed over banks
- then, a GOOD memory controller will do the same thing, hashing addresses so that, as much as possible, it's hard to construct a realistic access pattern that hammers a single bank of DRAM rather than spreading maximally over all the available banks and ranks.
Mixing reads and writes, likewise, is design dependent. The traditional algorithms like open-page first and FR-FCFS are (like their cache counterparts) naive tools from the days when transistors were expensive; if you're willing to burn transistors you can do far far better. My M1 PDFs (I think it's volume 3) describe the evolution of the Apple Memory controller, which uses a very clever three-level scheme (plus the willingness to have fairly large queues in the controller) to sort requests to optimize for both hitting open pages and read/write turnaround WHILE STILL maintaining QoS for various clients.
In my testing I did not see notable DRAM M1 bandwidth falloff for any patterns I tried, from full read to various mixed read+write to full write. (Of course having the SLC as a huge memory-side data buffer also helps...)
> Etienne (etienne_lorrain.delete@this.yahoo.fr) on December 8, 2022 6:20 am wrote:
> > Looks like my AMD Ryzen 9 7950x has a L3 cache bandwidth of 63.9 GB/s, my current DRAM DDR5
> > has either 49.6 GB/s (Jedec) or 52.5 GB/s (AMD Expo) measured by memtest86 UEFI.
> > It seems some companies are increasing DRAM bandwidth: 8Gbps DDR5.
> >
> > I assume latency to L3 cache is still probably better than latency to
> > DRAM, but in simple terms, do we still need L3 cache in processors?
>
> A critical difference between peak DRAM throughput, and L3 throughput, is that L3 throughput is independent
> of access pattern (as long as you never leave L3, of course) - you get the same throughput from L3 whether
> you read cachelines sequentially, or whether you read cachelines in a random order, and you get the same
> throughput when writing whether you write sequentially, or whether you write in a random order. There's
> also no penalty for mixing writes and reads - the timings are the same for read then read another line,
> as for read then write and for write then read and for write then write another line.
>
> DDR5 doesn't offer that - your throughput is lower if you read or write 64 byte chunks at
> random throughput the chip than if you arrange to stay in the same bank group as much as
> possible. There's also a small penalty for mixing reads and writes, so you benefit from L3
> if it lets you do more writes in sequence before switching back to reads or vice-versa.
Not really. This may be the case with lousy designs but
- L3 will be banked so (in theory) a worst-case pattern will hammer a single bank...
BUT in reality this is not an issue because the addresses will be hashed before being distributed over banks
- then, a GOOD memory controller will do the same thing, hashing addresses so that, as much as possible, it's hard to construct a realistic access pattern that hammers a single bank of DRAM rather than spreading maximally over all the available banks and ranks.
Mixing reads and writes, likewise, is design dependent. The traditional algorithms like open-page first and FR-FCFS are (like their cache counterparts) naive tools from the days when transistors were expensive; if you're willing to burn transistors you can do far far better. My M1 PDFs (I think it's volume 3) describe the evolution of the Apple Memory controller, which uses a very clever three-level scheme (plus the willingness to have fairly large queues in the controller) to sort requests to optimize for both hitting open pages and read/write turnaround WHILE STILL maintaining QoS for various clients.
In my testing I did not see notable DRAM M1 bandwidth falloff for any patterns I tried, from full read to various mixed read+write to full write. (Of course having the SLC as a huge memory-side data buffer also helps...)