Textures, Rendering and Memory Pipelines
A modern GPU uses the texture and render output pipelines for loads and stores respectively. Unlike a CPU, where the load/store units are an integral part of the core, texturing and render output are decoupled from the computational core of a GPU. In the GT200, each Thread Processing Cluster (TPC) contains a group of three SMs, a shared texture pipeline and a port to communicate with the render output (ROP) units. Since this article focuses on the GPU as a compute device, we will omit a treatment of the graphical aspects of these two components and instead focus on their use for computational workloads. Figure 7 below shows the load and store pipelines for the GT200.
Figure 7 – Texturing Unit, ROPs and Memory Pipeline
Load and store instructions are generated in the SMs, however, they are issued to and executed in hardware that is in an entirely different clock domain. First, the load and store instructions flow into the SM Controller, which arbitrates access to the memory pipelines and acts as the clock boundary between the texture units and the SMs. The load pipeline shares hardware with the texture pipeline, so the two are mutually exclusive and cannot be used simultaneously. The first step is address calculation – memory instructions use register plus offset addressing that must be computed. Next the 40b virtual address (compared to 32 bits of virtual addressing for the G80) is translated to a physical address by the MMU. Loads are then issued across a whole warp and sent over the intra-chip crossbar bus to the GDDR3 memory controller. Store instructions are handled in a similar manner, first addresses are calculated and then the stores are sent across the intra-chip crossbar to the ROP units and then to the GDDR3 memory controller. Atomic instructions (read-modify-writes) are also sent through the store pipeline and ROP units. While memory accesses are issued a warp at a time, they are executed by the memory controller in half-warp groups (i.e. 16 accesses at a time).
The memory pipeline depicted in Figure 7 features two specialized texture caches. These caches are distinct from a traditional CPU cache in several ways. First of all, CPU caches have locality in a single dimension because memory addressing for most architectures is linear. When requesting a data word (which might be 4-8B), an entire 64B cache line will be fetched – so on top of the requested data, there is another 56-60B of data that is brought into the cache because most of the time, this data will be used in close temporal proximity to the original requested data. Textures are fundamentally 2 dimensional objects and are stored in memory in a continuous fashion with respect to both the X and Y dimensions; again traditional data has only one dimension and is only continuous with respect to that dimension. Consequently, texture caches must have 2-dimensional locality to effectively cache textures. Typically, the memory controller is responsible for mapping the 2D texture memory space into one dimension with space filling curves before reaching the cache, but the texture caches may have modifications which improve locality and performance.
Second, texture caches are also read-only and have no coherency. When a texture is written, the entire texture cache hierarchy must be invalidated, rather than tracking the validity of individual data within the address space. Third, texture caches are used to save bandwidth and power only – in contrast to CPU caches which are also essential for lowering latency. Texture caches cannot service requests out-of-order and do not meaningfully impact latency (in comparison a CPU cache might lower latency from 100ns to 7ns). The L1 texture caches in the GT200 reside in the TPC and are 24KB, but they are partitioned into 3x8KB caches in each TPC. The L2 texture caches are located with the memory controllers, each one is 32KB for an aggregate 256KB across the entire device.
Memory performance is one of the most critical aspects of a GPU, since no caching is possible for global memory accesses. In order to efficiently access memory, loads and stores must be aligned to 4B boundaries – improperly aligned accesses will result in multiple loads or stores being generated by the compiler. More importantly, load instructions across different threads in a half-warp can be coalesced by the memory controller to reduce the number of memory transactions.
One of the most important improvements for the GT200, that few have discussed, is that the memory coalescing rules are much more flexible. For compute 1.0 and 1.1 devices (everything based on the G80), 16x32b accesses can be coalesced if they are aligned to 64B boundaries and in a single 64B region, 16x64b accesses must be 128B aligned and in a single 128B region and 16x128b accesses must be 128B aligned and span exactly two 128B regions. Additionally, threads must sequentially access each word in the coalesced memory transaction – i.e. thread K must access the Kth data word. If the instructions cannot be coalesced, then they issue as 16 distinct memory accesses.
Newer Compute 1.2 devices (i.e. GT200 and later) substantially relax these rules and enable coalescing for 8b and 16b data words (using 32B and 64B memory transactions respectively). First, there is no longer any requirement that threads sequentially access data words within a memory transaction – any ordering is fine, even ones where multiple threads read the same data word. Second, the algorithm is much more efficient for handling misaligned accesses. Specifically, if 128B of data is being accessed, but is misaligned (i.e. spans multiple 128B regions), earlier compute devices would have spawned 16 separate accesses. The GT200 and all variants will instead issue separate memory transactions – one for each different region being accessed, rather than always using 16 different transactions. Additionally, the memory controller can truncate accesses to 32B or 64B to conserve bandwidth (i.e. if 128B of data is split with 96B in one region and 32B in another, it will end up resolving as 1x128B and 1x32B access, not 2x128B accesses).