Inside Fermi: Nvidia’s HPC Push

Pages: 1 2 3 4 5 6 7 8 9 10 11


As Figure 3 on the prior page showed, the already massive storage resources for each core have been doubled to support the second pipeline. Fermi’s register file increased to 32K entries or 128KB total, although the number of registers per execution unit dropped by half, to 1K, entries versus the current generation. In the GT200, a single thread can be allocated 4-128 registers – although it’s unclear whether Fermi offers the same range. As with previous implementations, 64 bit integer or floating point values will consume two registers. The register file is designed to sustain full 32-bit throughput every cycle, which requires 96 input values and 32 output values, in the case of a multiply-add.

Operand Collectors

One of the more puzzling aspects of both past and current Nvidia hardware is the read and writeback for long latency functional units, such as the SFU and memory pipeline.

In the GT200, a SFU warp can be issued, followed by an FMAD warp; since the SFU instruction takes a while to execute, the two will be executing simultaneously. Perhaps more importantly, the two warps appear to need simultaneous input – which begs the question, how can they all be fed at once?

The answer, a piece of hardware called the operand collector, came to light in our discussions regarding Fermi. The operand collectors are hardware units which read in many register values and buffer them for later consumption, a sort of temporal register cache for the functional units (interestingly enough, the EV8 had something similar). They probably provide other benefits as well – perhaps broadcasting values across the functional units. So the operand collector could read out all 64 operands needed for a SFU operation, and then feed them to the SFU over the next 16 cycles. This radically simplifies scheduling, by enabling inputs to be gathered at once, even for high latency instructions.

Additionally, the operand collectors can be wired with a crossbar to enable many low port count (e.g. 1R+1W or 1R/W) register files to appear as if they were a single, highly ported register file. While the operand collector most obviously works with the register file, it may also be able to collect operands from other sources such as shared memory and constant or texture caches. Fermi, the GT200 and G80 generations probably all include the operand collectors, and the minimum sizes are likely 64 or 96 operands.

Results Queue

The same potential conflicts that occur for register reads, outlined above, also apply to writes into the register file. The dual to the operand collector is the results queue, which buffers the output from functional units before being written back into the register file. To be effective the results queue should be at least the size of a warp – 32 operands.Although not necessarily present in Nvidia hardware, the results queue creates an opportunity for additional optimization. Specifically, not all operations need to writeback into the register file – some operations only exist to produce an intermediate result which is consumed by another operation. In that situation, if the two can be dynamically scheduled close enough, it’s conceivable that the output value could be forwarded to the input of the next operation – just like a CPU’s forwarding network, except within a vector lane.

Pages: « Prev   1 2 3 4 5 6 7 8 9 10 11   Next »

Discuss (281 comments)