SIMD Memory Pipeline
While bragging rights for a processor seem to be tied to the number of TFLOP/s, the key to getting any work done is an excellent memory pipeline and hierarchy to back it. The memory hierarchy is also where the trade-offs between performance and programmability are most visible. For instance, developers find cache coherency with a sequential consistency model the easiest to program. Both x86 and IBM’s zSeries have very strong ordering models with tight restrictions on how loads and stores in different processors can interact. Historically, GPUs were non-coherent and had extremely weak (if any) ordering models. However, as they begin to target general purpose workloads, they are beginning to adopt more traditional memory hierarchies.
The memory hierarchy for Cayman is similar to previous generations, with incremental improvements. The primary emphasis is performance for graphics and not general purpose programmability. AMD makes extensive use of two explicitly addressed memory structures and separate read and write paths, each with specialized caches. Separating out the read and write data paths improve performance and throughput, but makes any semblance of coherency extremely expensive. For graphics, that is no problem; for general purpose workloads, coherency is highly desirable, hence Nvidia’s choice to use a more CPU-like memory hierarchy.
Figure 4 – Cayman SIMD Memory Pipeline and Comparisons
In Cayman and Cypress, the first explicitly addressed (and architecturally visible) memory structure is the 32KB Local Data Share, or LDS, that lies in each SIMD. The LDS is exposed through the OpenCL and Direct Compute specifications, which require a 16KB and 32KB (respectively) array. The LDS is accessed by ALU wavefronts, and is used for read/write communication and synchronization within a work-group. Each VLIW bundle can load two 32-bit values from the LDS, using one of the ALU operations. For high-end devices, the LDS is a 32 way banked structure and each bank is a 1R+1W array that is 32-bits (or 4B) wide, with a theoretical bandwidth of 128B/cycle. Banks in the LDS also includes simple execution units for local atomic operations on data in the LDS (e.g. arithmetic, logical, min/max, compare and swap). The LDS includes bank conflict detection and resolution hardware, which serializes access to each bank. If there are N accesses to the same bank, the load will take N cycles and reduce bandwidth by a factor of N. The LDS can also broadcast a single value to multiple execution units. Low-end parts may only have 16 banks and accordingly, half the bandwidth.
One of the improvements in Cayman is more efficiently moving data into the LDS from memory. In Cypress, moving data into the LDS takes a memory instruction and an ALU instruction. Data must first be loaded from memory into the register files and then subsequently moved from the register files into the LDS. Cayman can directly fetch from memory into the LDS, eliminating the ALU instruction altogether.
The second architecture memory structure is the Global Data Share (GDS), which is 64KB and shared by the entire GPU. The GDS plays a similar role to the LDS, but for sharing and communication across an entire kernel, rather than just a work-group. It is also 32-banked, with 25 cycle access latency, and includes atomic execution units and counters for append instructions and reductions. While not technically a part of the SIMD (since it is a globally shared structure), the GDS is explicitly available to each SIMD. The GDS is a structure that does not correspond to anything in the OpenCL or DirectCompute specification (unlike the LDS), and must be accessed and exposed through a vendor specific extension. However, it is used by the drivers for certain DirectX features such as append/consume buffers and UAV counters.
AMD’s GPUs also include two levels of non-coherent texture caches which can be used for read-only data access, in addition to the explicitly accessed LDS and GDS. As previously noted, the AMD architecture uses dedicated clauses/wavefronts for memory accesses. These memory (or texture) wavefronts do not execute simultaneously with ALU wavefronts, although they can be interleaved as discussed previously. This separation is critical for AMD’s microarchitecture. Rather than include a dedicated address generation unit, the normal ALU pipelines are used to calculate virtual addresses for the wavefront (like many RISC architectures). Once the address has been calculated, it is passed to the Texture Mapping Units for address translation and the actual cache access. The TMUs will translate the request address into the appropriate co-ordinates (e.g. for pixel or vertex loads).
The four TMUs will probe the fully associative, 8KB L1 texture cache for 128-bit (16B) accesses each cycle. The cache is organized into 64B lines, so that aligned accesses in each quarter wavefront will target a single cache line. If the requests are properly structured, Cayman can achieve an aggregate 1.3 TB/s from the L1 caches, versus just over 1TB/s for Cypress. However, Cayman also improves the realizable bandwidth from the L1 cache with better memory coalescing for general purpose loads.
GPUs are focused on maximizing bandwidth and one of the key optimizations is memory coalescing – grouping together aligned reads or writes that have good locality. Each coalesced memory access can use a single address and request, while moving many pieces of data. In AMD GPUs, there are several different granularities for exploiting locality. The first level is maximizing the utilization of each 128-bit register, which can hold four 32-bit data values. For Cypress, this requires explicitly using 4-wide memory accesses (or 2-wide for 64-bit data). Otherwise, a memory wavefront will send one address for each data item and take 16 cycles to execute, instead of the normal 4. Cayman can coalesce together a wavefront of 32-bit single memory accesses so that it executes over the normal 4 cycles (or 8 cycles for 64-bit data). This hides the underlying 4-wide vectors from programmers by implicitly coalescing together scalar memory accesses and maximizes the bandwidth from the memory hierarchy for both loads and stores.
The second level of locality is coalescing together 4 different 16B accesses into a single cache line access. Both Cayman and Cypress are quite efficient, and read coalescing is done in the texture caches, while write coalescing is done further out in the pipeline. There is a third level of locality, which is coalescing together memory requests – but that is even further still.
The TMUs also include fixed function texture filtering hardware, which is primarily useful for graphics but can be used for some very specialized compute workloads. The texture units sample and interpolate textures. The sampling and interpolation can be ordinary bi-linear or tri-linear filtering, or anisotropic filtering to correct for the surface normal. The texture hardware can massively boost performance for the rare compute application which needs filtering (e.g. medical imaging), by off-loading the filtering from the general purpose execution units to dedicated hardware.
Fermi’s memory pipeline is slightly different. First, it has a unified addressing space and virtual memory. Second, the shared memory and the coherent L1D use the same underlying physical structure (as opposed to separate structures for Cayman’s LDS and L1 texture cache). This 64KB array is 32-banked and effectively has 128B lines for the L1D. Since it is dual use, the shared memory and L1D cache can only receive 128B every other cycle – for an effective bandwidth of 64B/cycle.