By: Megol (golem960.delete@this.gmail.com), August 6, 2016 2:30 am
Room: Moderated Discussions
Jouni Osmala (a.delete@this.b.c) on August 5, 2016 11:49 pm wrote:
> > > Regarding the vertex buffer: what you described is the same as how I was interpreting
> > > it already (but probably not expressing clearly), so I agree :-)
> > >
> > > > I'll suggest that you submit triangles scattered around the whole
> > > > screen so that you don't have a single consistent submission pattern
> > > > that covers the entire screen.
> > >
> > > I've tried scattering small triangles with "x += 1.0 + sin(VertexID / 3); y -= 1.0
> > > + sin(1.7 * (VertexID / 3));" and the behaviour is essentially the same as before.
> > >
> > > If I set it to 21 floats per vertex, it first draws the first approximately 128 triangles: It
> > > starts by drawing all those triangles in order, clipped to the top-left 256x512 px region, then
> > > it moves onto the next region and draws them all again, etc, until it's filled the screen. Then
> > > it starts again with the next ~128 triangles in the top-left region and repeats.
> > >
> > > If I set it to 17-20 floats per vertex, it's similar but draws ~256 triangles in each iteration.
> > >
> > > If I set it to 16 floats per vertex, it's similar but draws ~384 triangles in each iteration.
> > >
> > > The numbers don't match up exactly, but I think that indicates there's an approximately 64KB buffer
> > > for vertex-shaded primitives. Once that buffer is nearly full (or at the end of a draw call), the
> > > rasteriser starts processing all the triangles in that buffer (multiple times, once per 256x512 region),
> > > and when it's finished it waits for another 64KB of data before starting the next pass.
> > >
> > > (Those numbers are from code that puts unique values in every vertex output. If there
> > > are duplicate values then it draws more triangles in each pass, so I believe the
> > > buffer contains compressed data, which makes it more confusing to analyse.)
> > >
> > > The hypothesised 0.5MB tile cache/buffer/etc comes from
> > > those 256x512 regions (at 32bpp, no MSAA, no depth):
> > > it's reading and writing the framebuffer in those regions many times as it iterates over the few hundred
> > > triangles, but it's careful not to access two regions at
> > > once, which makes sense if they have 0.5MB of dedicated
> > > memory for it (though I suppose it could still make sense if it's just sharing L2 or something).
> >
> > Welcome to the world of black box reverse engineering. This is just the kind
> > of stuff we had to do to figure out in order to write the Nouveau driver.
> >
> > While you're at it take a look at the machine code being sent to the GPU, figure out all the
> > MMIO locations, caches, registers et cetera, and put sensible human readable assembler tags
> > on the machine code. Meanwhile you'll have thousands of people whining at you that you're taking
> > too long, or that it will never be as fast as the closed binary driver. The Slashdot crowd might
> > even take up a funding campaign, as if money can help you figure out things faster.
> >
> > Laugh, I'm trying be funny behind the bitterness.
>
> Its parallerisable work with high minimum skill level, while little amount of money will not
> help, amount that could get more high skilled individuals spend more time on it should help
> as long as it's enough to over come effect it has on volunteers who don't get paid.
> Or maybe even splitting work to small competitions where first one
> to get verifiable results on X gets the price money if they want.
What you are suggesting sounds like having a competition between pregnant women about who will give birth of a healthy baby first. I wouldn't call that parallelizable.
> > > Regarding the vertex buffer: what you described is the same as how I was interpreting
> > > it already (but probably not expressing clearly), so I agree :-)
> > >
> > > > I'll suggest that you submit triangles scattered around the whole
> > > > screen so that you don't have a single consistent submission pattern
> > > > that covers the entire screen.
> > >
> > > I've tried scattering small triangles with "x += 1.0 + sin(VertexID / 3); y -= 1.0
> > > + sin(1.7 * (VertexID / 3));" and the behaviour is essentially the same as before.
> > >
> > > If I set it to 21 floats per vertex, it first draws the first approximately 128 triangles: It
> > > starts by drawing all those triangles in order, clipped to the top-left 256x512 px region, then
> > > it moves onto the next region and draws them all again, etc, until it's filled the screen. Then
> > > it starts again with the next ~128 triangles in the top-left region and repeats.
> > >
> > > If I set it to 17-20 floats per vertex, it's similar but draws ~256 triangles in each iteration.
> > >
> > > If I set it to 16 floats per vertex, it's similar but draws ~384 triangles in each iteration.
> > >
> > > The numbers don't match up exactly, but I think that indicates there's an approximately 64KB buffer
> > > for vertex-shaded primitives. Once that buffer is nearly full (or at the end of a draw call), the
> > > rasteriser starts processing all the triangles in that buffer (multiple times, once per 256x512 region),
> > > and when it's finished it waits for another 64KB of data before starting the next pass.
> > >
> > > (Those numbers are from code that puts unique values in every vertex output. If there
> > > are duplicate values then it draws more triangles in each pass, so I believe the
> > > buffer contains compressed data, which makes it more confusing to analyse.)
> > >
> > > The hypothesised 0.5MB tile cache/buffer/etc comes from
> > > those 256x512 regions (at 32bpp, no MSAA, no depth):
> > > it's reading and writing the framebuffer in those regions many times as it iterates over the few hundred
> > > triangles, but it's careful not to access two regions at
> > > once, which makes sense if they have 0.5MB of dedicated
> > > memory for it (though I suppose it could still make sense if it's just sharing L2 or something).
> >
> > Welcome to the world of black box reverse engineering. This is just the kind
> > of stuff we had to do to figure out in order to write the Nouveau driver.
> >
> > While you're at it take a look at the machine code being sent to the GPU, figure out all the
> > MMIO locations, caches, registers et cetera, and put sensible human readable assembler tags
> > on the machine code. Meanwhile you'll have thousands of people whining at you that you're taking
> > too long, or that it will never be as fast as the closed binary driver. The Slashdot crowd might
> > even take up a funding campaign, as if money can help you figure out things faster.
> >
> > Laugh, I'm trying be funny behind the bitterness.
>
> Its parallerisable work with high minimum skill level, while little amount of money will not
> help, amount that could get more high skilled individuals spend more time on it should help
> as long as it's enough to over come effect it has on volunteers who don't get paid.
> Or maybe even splitting work to small competitions where first one
> to get verifiable results on X gets the price money if they want.
What you are suggesting sounds like having a competition between pregnant women about who will give birth of a healthy baby first. I wouldn't call that parallelizable.