By: Philip Taylor (philip.delete@this.zaynar.co.uk), August 2, 2016 4:46 pm
Room: Moderated Discussions
Philip Taylor (philip.delete@this.zaynar.co.uk) on August 1, 2016 7:33 pm wrote:
> [...]
Hmm, my numbers were quite bogus. I tested a bit more and these might be slightly less wrong:
The framebuffer is split into four partitions, one containing 4/13 of all the tiles (where each tile is 16x16 pixels, and contains sub-tiles of 4x8 pixels), the others containing 3/13 each, in a sort of interleaved diagonal stripe pattern. (This presumably comes from the GTX 970 having 13 SMs, which are grouped into 4 SMMs.)
In each partition, it rasterises the sub-tiles within one ~256x512 pixel region of the framebuffer (or a smaller region when pixels have more bytes). It draws the first primitive across all the sub-tiles within that region in that partition, then the second primitive, etc. Then it moves on to the next large framebuffer region and loops through all the primitives again. (And it can cache the tiles in partition-local memory for the duration of a framebuffer region, which only needs ~0.5MB of tile cache in total.)
Each partition progresses independently and concurrently. Within each partition, it rasterises roughly 256 sub-tiles concurrently (~20% of the region), then waits for all those sub-tiles to complete before moving on to the next group of sub-tiles in that region. (Or something a bit but not exactly like that.)
It still seems to flush after something on the order of 64KB of vertex-shaded-primitive buffer, and after each draw call.
> [...]
Hmm, my numbers were quite bogus. I tested a bit more and these might be slightly less wrong:
The framebuffer is split into four partitions, one containing 4/13 of all the tiles (where each tile is 16x16 pixels, and contains sub-tiles of 4x8 pixels), the others containing 3/13 each, in a sort of interleaved diagonal stripe pattern. (This presumably comes from the GTX 970 having 13 SMs, which are grouped into 4 SMMs.)
In each partition, it rasterises the sub-tiles within one ~256x512 pixel region of the framebuffer (or a smaller region when pixels have more bytes). It draws the first primitive across all the sub-tiles within that region in that partition, then the second primitive, etc. Then it moves on to the next large framebuffer region and loops through all the primitives again. (And it can cache the tiles in partition-local memory for the duration of a framebuffer region, which only needs ~0.5MB of tile cache in total.)
Each partition progresses independently and concurrently. Within each partition, it rasterises roughly 256 sub-tiles concurrently (~20% of the region), then waits for all those sub-tiles to complete before moving on to the next group of sub-tiles in that region. (Or something a bit but not exactly like that.)
It still seems to flush after something on the order of 64KB of vertex-shaded-primitive buffer, and after each draw call.