Inside Fermi: Nvidia’s HPC Push

Pages: 1 2 3 4 5 6 7 8 9 10 11

The Front End

The changes in the core (or SM) start at the front-end and cascade from there. Each Fermi core has an instruction cache, with the standard read-only semantics – although every other detail on the L1I cache is currently unknown, both about the GT200 and Fermi. In Fermi, the instruction cache is shared between the two pipelines and is almost certainly a set-associative design. Figure 3 below shows the cores from Fermi and GT200.


Figure 3 – Front-end for Fermi and GT200 Cores

Unfortunately, Nvidia is still not disclosing the instruction cache size, which would also hint at its purpose and functionality. If the goal is to cache most of a kernel (or shader) and exploit temporal locality, the cache size should be a reasonable fraction of a single kernel. However if the point is to merely amplify bandwidth, that could be done with a very small cache or perhaps by cleverly broadcasting an instruction fetch to multiple warp instruction queues. What is clear though is that the instruction caches were changed in Fermi substantially, since each core can be executing from a different instruction stream.

Fermi introduces full predication for all instructions to improve the instruction fetch by removing bubbles caused by taken branches. In GT200, a divergence would result in a warp executing through and then branching between each control flow path. At each branch, the warp would stall until the branch could be resolved and the next address fetched. With predication, the warp can sequentially execute through all the divergent control flow paths, without branches, and simply mask off the unused vector lanes.

The next change from the GT200 is just after the cache. Fetched instructions are delivered from the cache and then deposited into two logical warp instruction buffers for Fermi, one for each of the two pipelines. These buffers are likely implemented as a set of 48 queues, one for each warp. Each queue probably holds at least two entries – requiring 128-bits as most raw instructions appear to be 64-bits [1]. If 96 entries were implemented that would consume a total of 768B of SRAM. However, assuming that instruction cache lines are 64B wide, then it is more likely that each fetch brings in 4-8 instructions per warp, with a larger buffer (roughly 1.5-3KB). Once instructions are deposited into the warp queues, they must be decoded so that they can be scheduled.

Every cycle, the two schedulers can issue (or dispatch in Nvidia parlance) two warps from the head of these queues – one for each of the two pipelines. Again, each pipeline is still scalar, but there are now two for added throughput. Despite the notion that GPU cores are simpler than CPU cores, the schedulers have to tackle considerable complexity. One complication for the scheduler is the variety of computational instructions with different execution latency. Scheduling instructions with a uniform or near uniform latency is far simpler than trying to manage a pool of instructions where execution latency varies by a factor of 4.

The warps are scoreboarded to track the availability of operands and also beware of any structural hazards. The structural hazards in particular are vastly more complicated for Fermi than the GT200. The shared execution resources present part of the challenge here – the memory pipeline and special function units (SFUs) are the obvious points of contention between the two pipelines, but 64-bit FP operations are also a subtle issue. Specifically, a double precision warp uses both pipelines simultaneously to execute. So the permissible warp combinations are most memory/ALU, ALU/ALU, ALU/SFU and memory/SFU; double precision cannot be issued with anything else.

The schedulers will issue the highest priority, ready to execute warps and then mark certain queues as ‘not ready’, based on the expected latency of the instruction – and the schedulers will skip over them until they become ready again. As with the current generation, there is no penalty for switching between different queues, and it is likely that the same queue could continue to issue as well. Priority can be determined based on many factors such as shader type (vertex, geometry, pixel, compute), register usage, etc.

Another big change for programmers and the scheduler is the relative execution latency. Since each pipeline has 16 execution units, a simple warp now takes only 2 fast cycles to finish (or one scheduler cycle). This means that hiding a fixed amount of memory latency will take twice as many warps as it did before.

It is unclear whether the warp buffer and scheduler is actually implemented as a unified or split entity. A split scheduler would probably be more efficient, since each scheduler would only have to evaluate 24 of the 48 potential warps; the tricky part would be communicating when one scheduler has reserved a shared resource such as the memory pipeline or SFU.


Pages: « Prev   1 2 3 4 5 6 7 8 9 10 11   Next »

Discuss (281 comments)