By: Montaray Jack (none.delete@this.none.org), August 6, 2016 10:42 am
Room: Moderated Discussions
Jouni Osmala (a.delete@this.b.c) on August 6, 2016 9:08 am wrote:
> Megol (golem960.delete@this.gmail.com) on August 6, 2016 3:30 am wrote:
> > Jouni Osmala (a.delete@this.b.c) on August 5, 2016 11:49 pm wrote:
> > > > > Regarding the vertex buffer: what you described is the same as how I was interpreting
> > > > > it already (but probably not expressing clearly), so I agree :-)
> > > > >
> > > > > > I'll suggest that you submit triangles scattered around the whole
> > > > > > screen so that you don't have a single consistent submission pattern
> > > > > > that covers the entire screen.
> > > > >
> > > > > I've tried scattering small triangles with "x += 1.0 + sin(VertexID / 3); y -= 1.0
> > > > > + sin(1.7 * (VertexID / 3));" and the behaviour is essentially the same as before.
> > > > >
> > > > > If I set it to 21 floats per vertex, it first draws the first approximately 128 triangles: It
> > > > > starts by drawing all those triangles in order, clipped to the top-left 256x512 px region, then
> > > > > it moves onto the next region and draws them all again, etc, until it's filled the screen. Then
> > > > > it starts again with the next ~128 triangles in the top-left region and repeats.
> > > > >
> > > > > If I set it to 17-20 floats per vertex, it's similar but draws ~256 triangles in each iteration.
> > > > >
> > > > > If I set it to 16 floats per vertex, it's similar but draws ~384 triangles in each iteration.
> > > > >
> > > > > The numbers don't match up exactly, but I think that indicates there's an approximately 64KB buffer
> > > > > for vertex-shaded primitives. Once that buffer is nearly full (or at the end of a draw call), the
> > > > > rasteriser starts processing all the triangles in that buffer (multiple times, once per 256x512 region),
> > > > > and when it's finished it waits for another 64KB of data before starting the next pass.
> > > > >
> > > > > (Those numbers are from code that puts unique values in every vertex output. If there
> > > > > are duplicate values then it draws more triangles in each pass, so I believe the
> > > > > buffer contains compressed data, which makes it more confusing to analyse.)
> > > > >
> > > > > The hypothesised 0.5MB tile cache/buffer/etc comes from
> > > > > those 256x512 regions (at 32bpp, no MSAA, no depth):
> > > > > it's reading and writing the framebuffer in those regions many times as it iterates over the few hundred
> > > > > triangles, but it's careful not to access two regions at
> > > > > once, which makes sense if they have 0.5MB of dedicated
> > > > > memory for it (though I suppose it could still make sense if it's just sharing L2 or something).
> > > >
> > > > Welcome to the world of black box reverse engineering. This is just the kind
> > > > of stuff we had to do to figure out in order to write the Nouveau driver.
> > > >
> > > > While you're at it take a look at the machine code being sent to the GPU, figure out all the
> > > > MMIO locations, caches, registers et cetera, and put sensible human readable assembler tags
> > > > on the machine code. Meanwhile you'll have thousands of people whining at you that you're taking
> > > > too long, or that it will never be as fast as the closed binary driver. The Slashdot crowd might
> > > > even take up a funding campaign, as if money can help you figure out things faster.
> > > >
> > > > Laugh, I'm trying be funny behind the bitterness.
> > >
> > > Its parallerisable work with high minimum skill level, while little amount of money will not
> > > help, amount that could get more high skilled individuals spend more time on it should help
> > > as long as it's enough to over come effect it has on volunteers who don't get paid.
> > > Or maybe even splitting work to small competitions where first one
> > > to get verifiable results on X gets the price money if they want.
> >
> > What you are suggesting sounds like having a competition between pregnant women about
> > who will give birth of a healthy baby first. I wouldn't call that parallelizable.
>
> Totally opposite to your analogy. If there are multiple competition on MMIO locations,
> caches, registers etc... Then its more like many people competing on many different smaller
> issues. There is need to actually smartly split the work to separate competitions.
It also depends on what the Real goal is. I know that I was motivated by wanting to understand how a GPU works and helping out the Linux and OpenSource community was the second motivation, a working driver is only a side effect then. Hiring other people won't help me understand how it works, but it may make the driver complete faster.
> Megol (golem960.delete@this.gmail.com) on August 6, 2016 3:30 am wrote:
> > Jouni Osmala (a.delete@this.b.c) on August 5, 2016 11:49 pm wrote:
> > > > > Regarding the vertex buffer: what you described is the same as how I was interpreting
> > > > > it already (but probably not expressing clearly), so I agree :-)
> > > > >
> > > > > > I'll suggest that you submit triangles scattered around the whole
> > > > > > screen so that you don't have a single consistent submission pattern
> > > > > > that covers the entire screen.
> > > > >
> > > > > I've tried scattering small triangles with "x += 1.0 + sin(VertexID / 3); y -= 1.0
> > > > > + sin(1.7 * (VertexID / 3));" and the behaviour is essentially the same as before.
> > > > >
> > > > > If I set it to 21 floats per vertex, it first draws the first approximately 128 triangles: It
> > > > > starts by drawing all those triangles in order, clipped to the top-left 256x512 px region, then
> > > > > it moves onto the next region and draws them all again, etc, until it's filled the screen. Then
> > > > > it starts again with the next ~128 triangles in the top-left region and repeats.
> > > > >
> > > > > If I set it to 17-20 floats per vertex, it's similar but draws ~256 triangles in each iteration.
> > > > >
> > > > > If I set it to 16 floats per vertex, it's similar but draws ~384 triangles in each iteration.
> > > > >
> > > > > The numbers don't match up exactly, but I think that indicates there's an approximately 64KB buffer
> > > > > for vertex-shaded primitives. Once that buffer is nearly full (or at the end of a draw call), the
> > > > > rasteriser starts processing all the triangles in that buffer (multiple times, once per 256x512 region),
> > > > > and when it's finished it waits for another 64KB of data before starting the next pass.
> > > > >
> > > > > (Those numbers are from code that puts unique values in every vertex output. If there
> > > > > are duplicate values then it draws more triangles in each pass, so I believe the
> > > > > buffer contains compressed data, which makes it more confusing to analyse.)
> > > > >
> > > > > The hypothesised 0.5MB tile cache/buffer/etc comes from
> > > > > those 256x512 regions (at 32bpp, no MSAA, no depth):
> > > > > it's reading and writing the framebuffer in those regions many times as it iterates over the few hundred
> > > > > triangles, but it's careful not to access two regions at
> > > > > once, which makes sense if they have 0.5MB of dedicated
> > > > > memory for it (though I suppose it could still make sense if it's just sharing L2 or something).
> > > >
> > > > Welcome to the world of black box reverse engineering. This is just the kind
> > > > of stuff we had to do to figure out in order to write the Nouveau driver.
> > > >
> > > > While you're at it take a look at the machine code being sent to the GPU, figure out all the
> > > > MMIO locations, caches, registers et cetera, and put sensible human readable assembler tags
> > > > on the machine code. Meanwhile you'll have thousands of people whining at you that you're taking
> > > > too long, or that it will never be as fast as the closed binary driver. The Slashdot crowd might
> > > > even take up a funding campaign, as if money can help you figure out things faster.
> > > >
> > > > Laugh, I'm trying be funny behind the bitterness.
> > >
> > > Its parallerisable work with high minimum skill level, while little amount of money will not
> > > help, amount that could get more high skilled individuals spend more time on it should help
> > > as long as it's enough to over come effect it has on volunteers who don't get paid.
> > > Or maybe even splitting work to small competitions where first one
> > > to get verifiable results on X gets the price money if they want.
> >
> > What you are suggesting sounds like having a competition between pregnant women about
> > who will give birth of a healthy baby first. I wouldn't call that parallelizable.
>
> Totally opposite to your analogy. If there are multiple competition on MMIO locations,
> caches, registers etc... Then its more like many people competing on many different smaller
> issues. There is need to actually smartly split the work to separate competitions.
It also depends on what the Real goal is. I know that I was motivated by wanting to understand how a GPU works and helping out the Linux and OpenSource community was the second motivation, a working driver is only a side effect then. Hiring other people won't help me understand how it works, but it may make the driver complete faster.