By: Jouni Osmala (a.delete@this.b.c), August 5, 2016 11:49 pm
Room: Moderated Discussions
> > Regarding the vertex buffer: what you described is the same as how I was interpreting
> > it already (but probably not expressing clearly), so I agree :-)
> >
> > > I'll suggest that you submit triangles scattered around the whole
> > > screen so that you don't have a single consistent submission pattern
> > > that covers the entire screen.
> >
> > I've tried scattering small triangles with "x += 1.0 + sin(VertexID / 3); y -= 1.0
> > + sin(1.7 * (VertexID / 3));" and the behaviour is essentially the same as before.
> >
> > If I set it to 21 floats per vertex, it first draws the first approximately 128 triangles: It
> > starts by drawing all those triangles in order, clipped to the top-left 256x512 px region, then
> > it moves onto the next region and draws them all again, etc, until it's filled the screen. Then
> > it starts again with the next ~128 triangles in the top-left region and repeats.
> >
> > If I set it to 17-20 floats per vertex, it's similar but draws ~256 triangles in each iteration.
> >
> > If I set it to 16 floats per vertex, it's similar but draws ~384 triangles in each iteration.
> >
> > The numbers don't match up exactly, but I think that indicates there's an approximately 64KB buffer
> > for vertex-shaded primitives. Once that buffer is nearly full (or at the end of a draw call), the
> > rasteriser starts processing all the triangles in that buffer (multiple times, once per 256x512 region),
> > and when it's finished it waits for another 64KB of data before starting the next pass.
> >
> > (Those numbers are from code that puts unique values in every vertex output. If there
> > are duplicate values then it draws more triangles in each pass, so I believe the
> > buffer contains compressed data, which makes it more confusing to analyse.)
> >
> > The hypothesised 0.5MB tile cache/buffer/etc comes from
> > those 256x512 regions (at 32bpp, no MSAA, no depth):
> > it's reading and writing the framebuffer in those regions many times as it iterates over the few hundred
> > triangles, but it's careful not to access two regions at
> > once, which makes sense if they have 0.5MB of dedicated
> > memory for it (though I suppose it could still make sense if it's just sharing L2 or something).
>
> Welcome to the world of black box reverse engineering. This is just the kind
> of stuff we had to do to figure out in order to write the Nouveau driver.
>
> While you're at it take a look at the machine code being sent to the GPU, figure out all the
> MMIO locations, caches, registers et cetera, and put sensible human readable assembler tags
> on the machine code. Meanwhile you'll have thousands of people whining at you that you're taking
> too long, or that it will never be as fast as the closed binary driver. The Slashdot crowd might
> even take up a funding campaign, as if money can help you figure out things faster.
>
> Laugh, I'm trying be funny behind the bitterness.
Its parallerisable work with high minimum skill level, while little amount of money will not help, amount that could get more high skilled individuals spend more time on it should help as long as it's enough to over come effect it has on volunteers who don't get paid.
Or maybe even splitting work to small competitions where first one to get verifiable results on X gets the price money if they want.
> > it already (but probably not expressing clearly), so I agree :-)
> >
> > > I'll suggest that you submit triangles scattered around the whole
> > > screen so that you don't have a single consistent submission pattern
> > > that covers the entire screen.
> >
> > I've tried scattering small triangles with "x += 1.0 + sin(VertexID / 3); y -= 1.0
> > + sin(1.7 * (VertexID / 3));" and the behaviour is essentially the same as before.
> >
> > If I set it to 21 floats per vertex, it first draws the first approximately 128 triangles: It
> > starts by drawing all those triangles in order, clipped to the top-left 256x512 px region, then
> > it moves onto the next region and draws them all again, etc, until it's filled the screen. Then
> > it starts again with the next ~128 triangles in the top-left region and repeats.
> >
> > If I set it to 17-20 floats per vertex, it's similar but draws ~256 triangles in each iteration.
> >
> > If I set it to 16 floats per vertex, it's similar but draws ~384 triangles in each iteration.
> >
> > The numbers don't match up exactly, but I think that indicates there's an approximately 64KB buffer
> > for vertex-shaded primitives. Once that buffer is nearly full (or at the end of a draw call), the
> > rasteriser starts processing all the triangles in that buffer (multiple times, once per 256x512 region),
> > and when it's finished it waits for another 64KB of data before starting the next pass.
> >
> > (Those numbers are from code that puts unique values in every vertex output. If there
> > are duplicate values then it draws more triangles in each pass, so I believe the
> > buffer contains compressed data, which makes it more confusing to analyse.)
> >
> > The hypothesised 0.5MB tile cache/buffer/etc comes from
> > those 256x512 regions (at 32bpp, no MSAA, no depth):
> > it's reading and writing the framebuffer in those regions many times as it iterates over the few hundred
> > triangles, but it's careful not to access two regions at
> > once, which makes sense if they have 0.5MB of dedicated
> > memory for it (though I suppose it could still make sense if it's just sharing L2 or something).
>
> Welcome to the world of black box reverse engineering. This is just the kind
> of stuff we had to do to figure out in order to write the Nouveau driver.
>
> While you're at it take a look at the machine code being sent to the GPU, figure out all the
> MMIO locations, caches, registers et cetera, and put sensible human readable assembler tags
> on the machine code. Meanwhile you'll have thousands of people whining at you that you're taking
> too long, or that it will never be as fast as the closed binary driver. The Slashdot crowd might
> even take up a funding campaign, as if money can help you figure out things faster.
>
> Laugh, I'm trying be funny behind the bitterness.
Its parallerisable work with high minimum skill level, while little amount of money will not help, amount that could get more high skilled individuals spend more time on it should help as long as it's enough to over come effect it has on volunteers who don't get paid.
Or maybe even splitting work to small competitions where first one to get verifiable results on X gets the price money if they want.