The changes in execution core are important and will improve performance substantially. But even more important and exciting is the fact that Nvidia’s engineers and architects decided to use a semi-coherent L1D cache for each core’s memory pipeline, which enables implicit communication between threads in a block and caching a portion of the unified address space to reduce off-chip bandwidth requirements.
Fermi’s memory pipeline has a throughput of up to 16×32-bit or 8×64-bit accesses each cycle, to either the shared memory or the load store unit (LSU) which accesses the L1D cache, but not both. The reasons will be evident shortly, but this creates a structural hazard which the scheduler must account for; for example, an ALU/Mem combination where the ALU operation sources data from the shared memory is not allowed.
The shared memory in Fermi has undergone a radical transformation from the GT200. In Fermi, the shared memory and L1 data cache are respectively used for explicit and implicit communication between threads. The two are implemented as a single physical structure; they split a single 16 banked, 64KB array. Simultaneous access would be possible, but only by replicating the input and output lines and increasing the banking even further. The array can be configured with a 16KB or 48KB shared memory, with the remainder used for L1D cache.
For older workloads that are written and compiled for 16KB shared memory (e.g. GT200), it may be preferable to use a 48KB L1 cache, since a larger shared memory may go unused. The same holds for OpenCL which only supports 16KB shared memory at the moment. A larger cache is also beneficial on irregular and unpredictable workloads, especially those with greater degrees of indirection (e.g. complex data structures like linked lists). For DirectCompute though, a larger shared memory space (up to 32KB) is required and hence would need a 48KB allocation. Fortunately the configuration is semi-dynamic. The cache and shared memory can be reconfigured if the entire core is quiesced, typically between two different applications, but in theory the configuration could be done between kernels. It’s rather unclear why there is an option for 48KB shared memory; it seems like a waste. Neither DirectCompute nor OpenCL can use the entire 48KB – it seems like a 32KB/32KB split would be more sensible, but presumably Nvidia’s architects had their reasons.
The shared memory has also been impacted by the additional pipeline and execution unit changes. In GT200, the entire 16KB and 16 banked shared memory was used by only a single pipeline, and could provided 16 operands to a warp. Now that shared memory may be split between two pipelines, the capacity available to each pipeline may decrease (in a 16KB configuration) impacting occupancy, and the shared access will reduce the available bandwidth to each warp.
The memory pipeline, which starts with the semi-coherent L1D, has change tremendous. First of all, it is no longer shared between several cores. The whole notion of a TPC is therefore gone, as it was largely an artifact of how the memory hierarchy differed from the control hierarchy in GT200 and G80. Figure 5 below shows the memory pipelines for Fermi and GT200.
Figure 5 – Memory Pipeline for Fermi and GT200
The memory pipeline starts with dedicated address generation units (AGUs). The AGUs have been modified to support the new (x,y) addressing mode, at the behest of OpenCL. This third addressing mode complements the existing register indirect and texture addressing. In Fermi, there is a single load and store instruction for almost all memory types; although texturing may require a different instruction since it is so different from other accesses.
Once the 40-bit virtual address has been calculated, it is translated from a 40-bit physical address by the TLBs. Fermi’s TLBs obviously support 4KB page sizes for interoperability reasons, but they would not disclose the largest supported page size. GT200 appears to support 512KB pages  and it is likely Fermi has the same capabilities.
The size or associativity of the TLBs was also not disclosed for Fermi, but it only needs to cover the data accessed by a single core. The L1 TLB in GT200 is believed to be a 16-32 entry, fully associative design with 4KB and 512KB support  , and it is likely that Fermi’s TLBs are the same size or slightly larger. The TLBs also check privileges – and pages can be marked as read-only (e.g. constants).
Once addresses are translated, the cache is probed for 16×32-bit accesses, or 8×64-bit accesses each cycle. The L1D cache is probably a write back and write allocate design with 64B lines and support for streaming some data directly back to main memory. The replacement policy is undisclosed, but is likely a pseudo-Least Recently Used (LRU) variant. As previously mentioned, the L1D is either 16KB or 48KB, with 16 banks for high throughput. The associativity of the L1D is unknown, but the organization implies that each way is 16KB or smaller. Given the highly associative caches in prior generations  , it is likely that the L1D is effectively 16-way associative or better.
In the case of the miss, the access will go to the L2 cache, which is discussed (alongside the semi-coherency) in the next section.