The L2 cache deserves its own section for a number of reasons – not only is it the key to coherency and atomic operations, but it is also in its own clock domain.
The 768KB unified L2 cache is the sole memory agent, handlings loads, stores and texture fetches – thus acting as an early global synchronization point. Like the L1 cache, it probably has 64B lines and many banks; the write policies will be discussed below. While Nvidia did not discuss any implementation details, the GT200 has a 256KB L2 texture cache, implemented as 8 slices of 32KB, one slice per memory controller. If Fermi follows that pattern, the L2 cache might be implemented as 6 slices of 128KB, one for each memory controller.
Unlike a CPU, the caches in Fermi are only semi-coherent due to the relatively weak consistency model of GPUs. The easiest way to think about the consistency is that by default there is synchronization between kernels, and if the programmer uses any synchronization primitives (e.g. atomics or barriers), but no ordering otherwise.
Each core can access any data in its shared L1D cache, but cannot generally see the contents of remote L1D caches. At the end of a kernel, cores must write-through the dirty L1D data to the L2 to make it visible to both other cores in the GPU and the host. This might be described as a lazy write-through policy for the L1D, but it seems more accurately described as write back with periodic synchronization.
The write policy for the L2 is likely some sort of write-back with support for streaming accesses. Some data is unlikely to be re-used as it is being streamed in and out of memory – this data should be written through to the global memory without any L1 or L2 allocation. Other data is likely to be re-used or shared, and should be written back and allocated in the L2. Typically CPUs have writeback with allocation for their caches, but certain memory regions are uncacheable and used for streaming – this seems like a reasonable guess for Fermi’s L1 and L2 design. The L2 cache replacement policy is currently unknown, but is probably some pseudo-LRU variant.
Since the L2 is used to make results globally visible, it can also be used to accelerate execution of global atomic operations, which enables more efficient synchronization between thread blocks across the chip.
Previously, atomic operations had to write back to global memory to make their results globally visible. Thus the minimum latency was around 350-400ns for GT200 . If multiple operations in a warp were contending for the same address, then each atomic would execute serially, causing an additional trip to memory – up to a 32X penalty in the worst case (~13,000ns). Shared memory atomics were vastly faster, but the shared memory is obviously capacity limited and also cannot be used to communicate between blocks or cores.
In Fermi, the L2 cache and extra atomic execution units are used to accelerate atomic operations. Fermi’s unified L2 cache serializes atomic operations in a different manner, to reduce the number of memory write backs. In the case of full contention (32×32-bit accesses to the same address), the new atomic execution path will probably do all serialization prior to writing back to memory – so only one or two memory accesses are required. There is additional overhead from the actual operations, so Nvidia’s upper-end claim of a 20X speed up only refers to the extreme case of reducing the memory accesses from 32 to 1, with execution overhead. This in turn implies that executing 32 atomic operations (excluding the write-back) takes about the same latency as half a trip to memory, perhaps around 100-200ns.
In more realistic usage scenarios, the benefits from the new atomic operations will be substantially lower, and ultimately depend on the implementation of atomic operations, which was not disclosed. Nvidia claimed speedups of 5-20X; unfortunately it is hard to tease out the scenario corresponding to a 5X speed up – and thus to determine if it really applies to common atomic usages. It may have to do with the extra atomic units, which can be used in parallel for uncontended operations, or it may be that many cases of atomic operations proceed with the same latency as before – certainly shared memory atomics have stayed the same. Until the hardware is released and analyzed, it is just a guessing game.
For non-atomic operations, the L2 also acts to coalesce many different memory requests (e.g. from a whole warp) into fewer transactions, improving bandwidth utilization.