By: Philip Taylor (philip.delete@this.zaynar.co.uk), August 1, 2016 6:33 pm
Room: Moderated Discussions
Rob Clark (robdclark.delete@this.gmail.com) on August 1, 2016 5:56 pm wrote:
> [...]
>
> But there are gains to be had for IMR's with clever thread scheduling like this.
> So it is still neat. And a lot less driver trickery needed (at least for ogl..
> vulkan looks friendlier for tilers in that regard, at least if used properly).
Yep, as far as I can tell (based on some research and a lot of wild speculation) it's essentially changing from the traditional IMR rasterisation process of:
to something more like:
which benefits by improving the spatial locality of framebuffer updates (so they can be cached closer to the shader cores), and looks like it wouldn't have a major impact outside the rasteriser.
It should help when you've got a single dense mesh (many tiny primitives per tile), but probably won't really help at all with overdraw or with low-detail meshes (less than 1 primitive per tile per draw call). (So, it's not very useful for mobile GPUs that spend nearly all their life rendering UIs made of way too many alpha-blended layers.)
> Interesting about the amount of geometry it can accumulate for executing frag shader stage
> OoO.. given that a lot of games have a lot of geom. I guess that must include varying
> data, which would cut down on # of primitives somewhat? Otherwise they would need to split
> out separate binning shader (which from what I know about nouveau, they do not)
I think they do store the varyings. Dense meshes should already be constructed with good spatial locality between primitives, so the primitive batching buffer only needs to be large enough to collect enough spatially-local primitives to fill up a few tiles, it doesn't need to contain an entire draw call. On my device it looks a bit like it's processing 8 tiles concurrently in each of the 13 SMs, and if each primitive is ~1 quad (2x2 px) in size then it needs to buffer ~512 primitives per SM to cover all its tiles, so a 64KB buffer should be enough for ~8 vec4 varyings per vertex (ignoring the other per-primitive state), which doesn't sound off by too many orders of magnitude.
(I'm sure my numbers (and possibly my theories) are completely bogus but hopefully they're sort of pointing in a plausible direction.)
> [...]
>
> But there are gains to be had for IMR's with clever thread scheduling like this.
> So it is still neat. And a lot less driver trickery needed (at least for ogl..
> vulkan looks friendlier for tilers in that regard, at least if used properly).
Yep, as far as I can tell (based on some research and a lot of wild speculation) it's essentially changing from the traditional IMR rasterisation process of:
for each draw:
for each primitive:
for each tile:
fetch framebuffer tile
for each pixel:
execute pixel shader
update framebuffer tile
store framebuffer tile
to something more like:
for each draw:
for each batch of primitives:
for each tile:
fetch framebuffer tile
for each primitive:
for each pixel:
execute pixel shader
update framebuffer tile
store framebuffer tile
which benefits by improving the spatial locality of framebuffer updates (so they can be cached closer to the shader cores), and looks like it wouldn't have a major impact outside the rasteriser.
It should help when you've got a single dense mesh (many tiny primitives per tile), but probably won't really help at all with overdraw or with low-detail meshes (less than 1 primitive per tile per draw call). (So, it's not very useful for mobile GPUs that spend nearly all their life rendering UIs made of way too many alpha-blended layers.)
> Interesting about the amount of geometry it can accumulate for executing frag shader stage
> OoO.. given that a lot of games have a lot of geom. I guess that must include varying
> data, which would cut down on # of primitives somewhat? Otherwise they would need to split
> out separate binning shader (which from what I know about nouveau, they do not)
I think they do store the varyings. Dense meshes should already be constructed with good spatial locality between primitives, so the primitive batching buffer only needs to be large enough to collect enough spatially-local primitives to fill up a few tiles, it doesn't need to contain an entire draw call. On my device it looks a bit like it's processing 8 tiles concurrently in each of the 13 SMs, and if each primitive is ~1 quad (2x2 px) in size then it needs to buffer ~512 primitives per SM to cover all its tiles, so a 64KB buffer should be enough for ~8 vec4 varyings per vertex (ignoring the other per-primitive state), which doesn't sound off by too many orders of magnitude.
(I'm sure my numbers (and possibly my theories) are completely bogus but hopefully they're sort of pointing in a plausible direction.)