By: steve m (steve.marton.delete@this.gmail.com), May 10, 2017 4:16 pm
Room: Moderated Discussions
David, have you re-assessed your statement that Nvidia uses a tile cache? Because that assessment is most likely false.

From your video it's clear that a single tile is not finalized before subsequent tiles are written to.
If you had a cache that was the size of a "tile", you would lose most gains if you decided to flush it before it's finished, to start rendering the next tile, and then the next, just to flush them halfway, and come back to finish rendering to the first tile. You're only utilizing your cache properly if you finish all work in it before you flush it. There's no indication of such a strategy here. Do you disagree?
What you are actually seeing is simply smart pixel shader dispatch order.
It seems clear to me that the coarse rasterizer can see more than one triangle at a time, and it has a fairly large output buffer. So it can buffer up several triangles per coarse tile, and dispatch those out of order to the fine rasterizer, which then dispatches the pixel shader wavefronts to the shader cores. But that's all it is. And you can see screen shots of AMD GCN GPUs with similar patterns showing definite out of order dispatch.
The GPU can use different strategies for shader dispatch order based on the numerous resources it has to juggle (attributes, interpolants, VGPR pressure limiting waves in flight, etc.), and based on the tiled memory layout of the render target (we all know render targets are not stored linearly in memory, again to optimize spacial coherence with respect to triangles on screen, especially small triangles).
The GPU presumably dispatches in an order that maximizes cache coherence, coherence with render target tiling, and maybe coherence in the various internal buffers (not "tile" buffers). Generally some sort of spacial coherence is implied by all that.
This test would be way more interesting with real world data, ie: finely tesselated meshes. In this case I think there would be little difference between the behavior of different GPUs, showing that there's no huge gain either way.
Not to mention multiple draw calls, where draw calls are obviously serialized (for all intents and purposes, though we know this is not quite true if we look at a shader thread trace).
However, your lofty claim has infected the internet, going so far that even Wikipedia quotes you on the Maxwell page, and links to the page on tile based deferred renderers, which Maxwell is obviously not. You did not make clear in your article that this has nothing to do with TBDR. And you didn't even mention in your video that if the triangles were from separate draw calls, there would be no such out of order dispatch. When people hear "tiled" they think TBDR aka PowerVR. People read a lot into these types of claims, so you could be more careful in your statements.

From your video it's clear that a single tile is not finalized before subsequent tiles are written to.
If you had a cache that was the size of a "tile", you would lose most gains if you decided to flush it before it's finished, to start rendering the next tile, and then the next, just to flush them halfway, and come back to finish rendering to the first tile. You're only utilizing your cache properly if you finish all work in it before you flush it. There's no indication of such a strategy here. Do you disagree?
What you are actually seeing is simply smart pixel shader dispatch order.
It seems clear to me that the coarse rasterizer can see more than one triangle at a time, and it has a fairly large output buffer. So it can buffer up several triangles per coarse tile, and dispatch those out of order to the fine rasterizer, which then dispatches the pixel shader wavefronts to the shader cores. But that's all it is. And you can see screen shots of AMD GCN GPUs with similar patterns showing definite out of order dispatch.
The GPU can use different strategies for shader dispatch order based on the numerous resources it has to juggle (attributes, interpolants, VGPR pressure limiting waves in flight, etc.), and based on the tiled memory layout of the render target (we all know render targets are not stored linearly in memory, again to optimize spacial coherence with respect to triangles on screen, especially small triangles).
The GPU presumably dispatches in an order that maximizes cache coherence, coherence with render target tiling, and maybe coherence in the various internal buffers (not "tile" buffers). Generally some sort of spacial coherence is implied by all that.
This test would be way more interesting with real world data, ie: finely tesselated meshes. In this case I think there would be little difference between the behavior of different GPUs, showing that there's no huge gain either way.
Not to mention multiple draw calls, where draw calls are obviously serialized (for all intents and purposes, though we know this is not quite true if we look at a shader thread trace).
However, your lofty claim has infected the internet, going so far that even Wikipedia quotes you on the Maxwell page, and links to the page on tile based deferred renderers, which Maxwell is obviously not. You did not make clear in your article that this has nothing to do with TBDR. And you didn't even mention in your video that if the triangles were from separate draw calls, there would be no such out of order dispatch. When people hear "tiled" they think TBDR aka PowerVR. People read a lot into these types of claims, so you could be more careful in your statements.