By: Montaray Jack (none.delete@this.none.org), August 5, 2016 2:26 pm
Room: Moderated Discussions
Philip Taylor (philip.delete@this.zaynar.co.uk) on August 4, 2016 4:06 pm wrote:
> Regarding the vertex buffer: what you described is the same as how I was interpreting
> it already (but probably not expressing clearly), so I agree :-)
>
> > I'll suggest that you submit triangles scattered around the whole
> > screen so that you don't have a single consistent submission pattern
> > that covers the entire screen.
>
> I've tried scattering small triangles with "x += 1.0 + sin(VertexID / 3); y -= 1.0
> + sin(1.7 * (VertexID / 3));" and the behaviour is essentially the same as before.
>
> If I set it to 21 floats per vertex, it first draws the first approximately 128 triangles: It
> starts by drawing all those triangles in order, clipped to the top-left 256x512 px region, then
> it moves onto the next region and draws them all again, etc, until it's filled the screen. Then
> it starts again with the next ~128 triangles in the top-left region and repeats.
>
> If I set it to 17-20 floats per vertex, it's similar but draws ~256 triangles in each iteration.
>
> If I set it to 16 floats per vertex, it's similar but draws ~384 triangles in each iteration.
>
> The numbers don't match up exactly, but I think that indicates there's an approximately 64KB buffer
> for vertex-shaded primitives. Once that buffer is nearly full (or at the end of a draw call), the
> rasteriser starts processing all the triangles in that buffer (multiple times, once per 256x512 region),
> and when it's finished it waits for another 64KB of data before starting the next pass.
>
> (Those numbers are from code that puts unique values in every vertex output. If there
> are duplicate values then it draws more triangles in each pass, so I believe the
> buffer contains compressed data, which makes it more confusing to analyse.)
>
> The hypothesised 0.5MB tile cache/buffer/etc comes from those 256x512 regions (at 32bpp, no MSAA, no depth):
> it's reading and writing the framebuffer in those regions many times as it iterates over the few hundred
> triangles, but it's careful not to access two regions at once, which makes sense if they have 0.5MB of dedicated
> memory for it (though I suppose it could still make sense if it's just sharing L2 or something).
Welcome to the world of black box reverse engineering. This is just the kind of stuff we had to do to figure out in order to write the Nouveau driver.
While you're at it take a look at the machine code being sent to the GPU, figure out all the MMIO locations, caches, registers et cetera, and put sensible human readable assembler tags on the machine code. Meanwhile you'll have thousands of people whining at you that you're taking too long, or that it will never be as fast as the closed binary driver. The Slashdot crowd might even take up a funding campaign, as if money can help you figure out things faster.
Laugh, I'm trying be funny behind the bitterness.
Even with AMD's documentation, very little is publicly known about whats going on inside the rasterizer of any of the GPUs. Starting points for guesses, the basics: Bresenham's line algorithm, Juan Pineda's 1988 Parallel Algorithm for Polygon Rasterization, and the Digital differential analyzer algorithm.
We kind of have a dual slit quantum type problem, the act of trying to investigate what's happening is affecting the outcome. On NVidia those attributes you're assigning unique values are handled by a different part of the Polymorph Engine, so it possibly could be changing the outcome the same way a pixel shader does.
It might be interesting to throw some geometry at it with even less spacial locality, something like a dense TIN. I think most modern hardware chokes on triangulated irregular networks, a shame really since they're great for terrain.
Does Microsoft still have a software implantation of DirectX? I haven't touched a Windows machine this decade. If they still do, what does it do with the program?
> Regarding the vertex buffer: what you described is the same as how I was interpreting
> it already (but probably not expressing clearly), so I agree :-)
>
> > I'll suggest that you submit triangles scattered around the whole
> > screen so that you don't have a single consistent submission pattern
> > that covers the entire screen.
>
> I've tried scattering small triangles with "x += 1.0 + sin(VertexID / 3); y -= 1.0
> + sin(1.7 * (VertexID / 3));" and the behaviour is essentially the same as before.
>
> If I set it to 21 floats per vertex, it first draws the first approximately 128 triangles: It
> starts by drawing all those triangles in order, clipped to the top-left 256x512 px region, then
> it moves onto the next region and draws them all again, etc, until it's filled the screen. Then
> it starts again with the next ~128 triangles in the top-left region and repeats.
>
> If I set it to 17-20 floats per vertex, it's similar but draws ~256 triangles in each iteration.
>
> If I set it to 16 floats per vertex, it's similar but draws ~384 triangles in each iteration.
>
> The numbers don't match up exactly, but I think that indicates there's an approximately 64KB buffer
> for vertex-shaded primitives. Once that buffer is nearly full (or at the end of a draw call), the
> rasteriser starts processing all the triangles in that buffer (multiple times, once per 256x512 region),
> and when it's finished it waits for another 64KB of data before starting the next pass.
>
> (Those numbers are from code that puts unique values in every vertex output. If there
> are duplicate values then it draws more triangles in each pass, so I believe the
> buffer contains compressed data, which makes it more confusing to analyse.)
>
> The hypothesised 0.5MB tile cache/buffer/etc comes from those 256x512 regions (at 32bpp, no MSAA, no depth):
> it's reading and writing the framebuffer in those regions many times as it iterates over the few hundred
> triangles, but it's careful not to access two regions at once, which makes sense if they have 0.5MB of dedicated
> memory for it (though I suppose it could still make sense if it's just sharing L2 or something).
Welcome to the world of black box reverse engineering. This is just the kind of stuff we had to do to figure out in order to write the Nouveau driver.
While you're at it take a look at the machine code being sent to the GPU, figure out all the MMIO locations, caches, registers et cetera, and put sensible human readable assembler tags on the machine code. Meanwhile you'll have thousands of people whining at you that you're taking too long, or that it will never be as fast as the closed binary driver. The Slashdot crowd might even take up a funding campaign, as if money can help you figure out things faster.
Laugh, I'm trying be funny behind the bitterness.
Even with AMD's documentation, very little is publicly known about whats going on inside the rasterizer of any of the GPUs. Starting points for guesses, the basics: Bresenham's line algorithm, Juan Pineda's 1988 Parallel Algorithm for Polygon Rasterization, and the Digital differential analyzer algorithm.
We kind of have a dual slit quantum type problem, the act of trying to investigate what's happening is affecting the outcome. On NVidia those attributes you're assigning unique values are handled by a different part of the Polymorph Engine, so it possibly could be changing the outcome the same way a pixel shader does.
It might be interesting to throw some geometry at it with even less spacial locality, something like a dense TIN. I think most modern hardware chokes on triangulated irregular networks, a shame really since they're great for terrain.
Does Microsoft still have a software implantation of DirectX? I haven't touched a Windows machine this decade. If they still do, what does it do with the program?