By: Peter McGuinness (peter.mcguinness.delete@this.gobrach.com), August 2, 2016 9:41 pm
Room: Moderated Discussions
....
>The demo doesn't show that
> they are in on-chip buffers, it just shows the sequence that pixel shaders are invoked in - but
> it shows a sequence that is primarily grouped by location rather than by triangle index, which only
> makes sense if it was designed to benefit from some kind of large tile cache.
Not only, it also makes sense to group shader tasks into spatially local regions within a triangle before scheduling them into a wavefront. This dramatically improves SIMD occupancy even for a scalar machine and hence improves rendering efficiency - all GPUs do this. It only looks like regular tiling because all the triangles are huge and are all directly on top of each other; scatter randomly sized triangles around the screen in random locations and you will soon see the apparent order break down. In fact, you only need to shift each triangle by one pixel to see this start to happen. The point is that the 'tiling' is relative to the triangles, not to the screen and the term 'tile based' is universally accepted to refer to a tiled screen.
> In all cases, the deferring is done to support reordering of triangles to be more cache-friendly, which implies
> grouping them into tiles.
In the context of GPU hardware, deferring never refers to reordering triangles; it always refers to deferring pixel shading until visibility determination is complete (software deferred rendering is a whole different thing). In any case, OpenGL and DX rules forbid the re-ordering of triangles and it is up to the application to group geometry to be cache friendly. Tilers do collect all draw calls and store intermediate screen-space display lists for later rasterisation which is a sort of reordering but is more properly referred to as a screen-space sort. However the nvidia machine doesn't do any of that. All it needs to do is to locally store the projected triangle, fetch its state into cache and locally generate pixel shading tasks directly out of the back end of its texturing engine. Especially for a draw load as massively imbalanced as this one, a single triangle can generate a huge number of shading tasks from a tiny amount of state. There is essentially no stress on the cache in this test.
>
> Maybe call it a "rasterizer triangle reorderer and tile cache"
> plus some adjectives like "large" or "new" or "improved".
As I said, reordering is not allowed and there is no tile cache. We should just call it what it is: an immediate mode renderer.