For GPUs, most of the memory hierarchy lies outside the cores (e.g. Cayman SIMDs) and is instead tightly coupled to each of the memory channels. One difference between a CPU and GPU is that the outer parts of the memory hierarchy in a GPU include actual execution resources, rather than using the ALUs and FPUs in each core. For Cayman, the memory hierarchy includes the L2 texture cache, the majority of the store path and the read/write caches. Note that Cypress, unlike a CPU or Fermi, has three different specialized paths for reads, simple writes and more complex writes. Cayman largely retains the previous organization, but with some subtle enhancements.
The last level of read cache in AMD GPUs is a unified 512KB L2 cache that is shared by all the SIMDs. The L2 is partitioned with a 64KB slice for each of the 8 GDDR5 memory channels. The L2 slices can read out a single 64B line each cycle for a total bandwidth of 450 GB/s to the L1 caches. Like most GPUs, the intent of the L2 cache is to exploit the spatial (rather than both spatial and temporal) locality of data.
AMD GPUs separate out the read and write data paths, and there are a set of caches that are specialized for stores to memory. The write caches work closely with the ROPs and are grouped together with each of the 8 memory channels. There is a write combining cache (WCC) for each memory channel, which coalesces multiple writes to a single cache line into one transaction. The WCCs also buffer up many writes so that they can be performed in a single batch and achieve maximum bandwidth. In Cayman and Cypress the WCCs are 4KB each, for a total of 32KB.
Figure 5 – Cayman Memory Hierarchy and Comparisons
Figure 5 shows one partition of the outer memory hierarchy for Cayman, Cypress and Fermi. Cayman and Cypress both have 8 partitions that run at the base frequency. Fermi has a wider memory interface and has 12 partitions, although these blocks run at 600MHz in compute oriented products.
For graphics, the ROPs are responsible for writing a render target out to memory. Each memory channel includes a 16KB color cache, 16 execution units for Z/stencil operations and 4 for color operations (the latter is also used for multi-sample anti-aliasing blends). All of this hardware is re-used for compute workloads in various ways.
Simple stores that access data in multiples of 32-bits can proceed directly from the WCC (for writes) to the memory controllers, along what AMD terms the FastPath. This most closely corresponds to the behavior of the depth buffer in graphics workloads and maps onto similar underlying hardware.
More complicated memory accesses that require global atomics or data type conversion (i.e. smaller than 32-bits) go through additional hardware (the so called CompletePath) and require extra latency and bandwidth. The CompletePath most closely maps to the color buffer path in hardware. The color cache is a fully read/write capable cache with the appropriate ordering semantics; in compute mode it is used for global atomic operations.
Atomic operations are split transactions – following the classic pattern of a read-modify-write. First, data is loaded from memory into the color cache and waits for acknowledgement – which can take many cycles. Then the data is modified in the cache, using the color execution units, and written back to memory. Atomic operations that do not return a value can be pipelined and use 2X the bandwidth of a normal store, because they perform both a read and a write.
For atomic operations that return the old unmodified value (e.g. all OpenCL atomics), the bandwidth consumed is even higher. An additional read and write operation to memory are needed to return the original value, increasing the bandwidth requirements by a factor of 4X over the case of a simple non-atomic store.
The last stop in the memory hierarchy is the actual graphics memory itself. The memory controllers are 64-bits wide and drive two 32-bit GDDR5 channels. The address space is partitioned into 256B blocks and distributed across the 8 memory channels, so a consecutive 2KB access will stripe across all the channels. The goal of the memory controller is to maximize bandwidth, which relies on both coalescing and also write buffering. A challenge with DDR and GDDR memory interfaces is that they are bi-directional, and insert dead cycles to switch between reading and writing data – wasting bandwidth. One of the purposes of the WCCs is to buffer writes over a period of time and then schedule them into a single batch write, reducing the dead cycles and increasing the available bandwidth.
Fermi’s more programmable memory hierarchy has a L2 cache with 32B lines that is used for both reads and writes. It is weakly ordered, with synchronization enforced between kernels and via explicit instructions. The L2 is also used by the ROPs for graphics and by the atomic execution units for compute workloads. In graphics products, this block runs at the graphics frequency used by fixed function hardware. While in compute products, it can run at a separate frequency (600MHz for the Tesla C2070).