By: Peter McGuinness (peter.mcguinness.delete@this.gobrach.com), August 4, 2016 12:34 pm
Room: Moderated Discussions
Good work, that is starting to look really interesting (although it is not what I was getting at - more of that later).
Your rendering order comments are mostly correct: the spirit of the OGL restriction is the same in that the order only affects overlapping triangles so my comment was sloppy but in my defence, I will say that we were dealing with a set of completely overlapping triangles so the 'same sample point' condition always applied. Also, while there is nothing in concept to prevent a processing of non-overlapping triangles in arbitrary order, in practice GPUs will preserve API order since in general, the information on whether they overlap or not is not available and that approach is safe.
Getting back to your experiment, you seem to be assuming that the limitation is a 0.5MB cache at the output which is filled and flushed as pixels are rendered (256x512x4B = 0.5MB) but I don't agree with that interpretation.
When you changed 'num floats per vertex' you overloaded the input of the rasteriser without changing anything about the framebuffer, so the change in behaviour is showing that there is an input vertex buffer of fixed size and the scheduler is waiting for that to be filled by the vertex shader before kicking off any rasterisation shading tasks.
It certainly looks like the scheduler is trying to bin submitted triangles into regions where possible, and it will be interesting to find out more about that. It might still be true that there is an output buffer (not a cache; I can't imagine it's running any kind of caching algorithm) but I can't think of a good reason for that.
My original suggestion was to add some randomness to the submission order of the triangles because the way you are doing it is imposing a a structure that might be nothing to do with the GPU. Again, I'll suggest that you submit triangles scattered around the whole screen so that you don't have a single consistent submission pattern that covers the entire screen. I think that will yield an interesting result.
By the way, what did you have to do to get the rasteriser to work on 3 rows at a time instead of 1?
Your rendering order comments are mostly correct: the spirit of the OGL restriction is the same in that the order only affects overlapping triangles so my comment was sloppy but in my defence, I will say that we were dealing with a set of completely overlapping triangles so the 'same sample point' condition always applied. Also, while there is nothing in concept to prevent a processing of non-overlapping triangles in arbitrary order, in practice GPUs will preserve API order since in general, the information on whether they overlap or not is not available and that approach is safe.
Getting back to your experiment, you seem to be assuming that the limitation is a 0.5MB cache at the output which is filled and flushed as pixels are rendered (256x512x4B = 0.5MB) but I don't agree with that interpretation.
When you changed 'num floats per vertex' you overloaded the input of the rasteriser without changing anything about the framebuffer, so the change in behaviour is showing that there is an input vertex buffer of fixed size and the scheduler is waiting for that to be filled by the vertex shader before kicking off any rasterisation shading tasks.
It certainly looks like the scheduler is trying to bin submitted triangles into regions where possible, and it will be interesting to find out more about that. It might still be true that there is an output buffer (not a cache; I can't imagine it's running any kind of caching algorithm) but I can't think of a good reason for that.
My original suggestion was to add some randomness to the submission order of the triangles because the way you are doing it is imposing a a structure that might be nothing to do with the GPU. Again, I'll suggest that you submit triangles scattered around the whole screen so that you don't have a single consistent submission pattern that covers the entire screen. I think that will yield an interesting result.
By the way, what did you have to do to get the rasteriser to work on 3 rows at a time instead of 1?