By: Simon Farnsworth (simon.delete@this.farnz.org.uk), August 2, 2016 8:12 am
Room: Moderated Discussions
wumpus (lost.delete@this.in-a.cave.net) on August 2, 2016 7:57 am wrote:
> Simon Farnsworth (simon.delete@this.farnz.org.uk) on August 1, 2016 1:01 pm wrote:
> > wumpus (lost.delete@this.in-a.cave.net) on August 1, 2016 11:29 am wrote:
> > > Rob Clark (robdclark.delete@this.gmail.com) on August 1, 2016 9:49 am wrote:
> > > > vvid (no.delete@this.thanks.com) on August 1, 2016 9:45 am wrote:
> > > > > Nvidia uses tiles since ~NV20.
> > > > >
> > > > > These small rectangles on video are ROP tiles (collection of pixels placed at
> > > > > adjacent location in the same memory bank) and can be compressed (nv40+).
> > > > >
> > > > > http://www.google.ch/patents/US7545382
> > > > > http://www.freepatentsonline.com/y2015/0154733.html
> > > > > https://kernel.googlesource.com/pub/scm/linux/kernel/git/mchehab/linux-media/+/media/v4.7-2/drivers/gpu/drm/nouveau/nvkm/subdev/fb/nv40.c
> > > > >
> > > > > Specific ordering pattern is likely a result of non-linear (swizzled)
> > > > > memory layout of ROP tiles grouped in a second level structure.
> > > > >
> > > > > AMD uses 8x8 tiles. It is highly intergrated with HSR system.
> > > > >
> > > >
> > > > "tile" is a bit of an overloaded term. What you are describing above is tiled format (ie. layout
> > > > of pixels in memory), which is a different thing from an internal tile buffer (ie. tiler gpu)
> > > >
> > >
> > > I'll have to watch the video, but it seems to me that "tiling" is largely a means of increasing
> > > cache hits while rendering (if not Nvidia's method, at least it can be used that way). Note
> > > that even when not deferred, unless the API/engine is specifically designed to spit out tiles
> > > (and likely even then) it is going to add roughly one frame of latency (because you presumably
> > > have to collect enough polygons to bother with each tile). This isn't a terribly good long
> > > term thing to do with VR on the horizon (which appears to want latency above all else).
> >
> > I don't see how you get the added frame of latency; both OpenGL
> > and Vulkan have concepts that effectively delimit
> > individual frames, and even a full-frame IMR is allowed to
> > batch the drawing up until you hit the "end of rendering"
> > command (be it glFlush(), glSwapBuffers(), or the more powerful Vulkan synchronization primitives).
>
> I don't see how "allowed" == "required". From the demo it appears that the ATI
> board simply draws the triangles as they appear, no latency involved (of course
> they could be waiting to receive all the triangles first, but that seems weird).
>
At least with AMD cards, I've seen them batch up the rendering commands until a glFlush(), glFinish() or eglSwapBuffers(); however, that does not add much latency (microseconds at most), because it still renders extremely quickly - it just waits for the application to indicate "you have all data for this frame" before it first sends the data to the hardware command queue, then schedules presentation of the frame for the first VSync interrupt after the hardware completes rendering.
There's thus no need for an entire frame of latency - you buffer until you get an "all rendering complete for this frame" marker, and then render at full speed ready to present at the earliest possible opportunity.
> Simon Farnsworth (simon.delete@this.farnz.org.uk) on August 1, 2016 1:01 pm wrote:
> > wumpus (lost.delete@this.in-a.cave.net) on August 1, 2016 11:29 am wrote:
> > > Rob Clark (robdclark.delete@this.gmail.com) on August 1, 2016 9:49 am wrote:
> > > > vvid (no.delete@this.thanks.com) on August 1, 2016 9:45 am wrote:
> > > > > Nvidia uses tiles since ~NV20.
> > > > >
> > > > > These small rectangles on video are ROP tiles (collection of pixels placed at
> > > > > adjacent location in the same memory bank) and can be compressed (nv40+).
> > > > >
> > > > > http://www.google.ch/patents/US7545382
> > > > > http://www.freepatentsonline.com/y2015/0154733.html
> > > > > https://kernel.googlesource.com/pub/scm/linux/kernel/git/mchehab/linux-media/+/media/v4.7-2/drivers/gpu/drm/nouveau/nvkm/subdev/fb/nv40.c
> > > > >
> > > > > Specific ordering pattern is likely a result of non-linear (swizzled)
> > > > > memory layout of ROP tiles grouped in a second level structure.
> > > > >
> > > > > AMD uses 8x8 tiles. It is highly intergrated with HSR system.
> > > > >
> > > >
> > > > "tile" is a bit of an overloaded term. What you are describing above is tiled format (ie. layout
> > > > of pixels in memory), which is a different thing from an internal tile buffer (ie. tiler gpu)
> > > >
> > >
> > > I'll have to watch the video, but it seems to me that "tiling" is largely a means of increasing
> > > cache hits while rendering (if not Nvidia's method, at least it can be used that way). Note
> > > that even when not deferred, unless the API/engine is specifically designed to spit out tiles
> > > (and likely even then) it is going to add roughly one frame of latency (because you presumably
> > > have to collect enough polygons to bother with each tile). This isn't a terribly good long
> > > term thing to do with VR on the horizon (which appears to want latency above all else).
> >
> > I don't see how you get the added frame of latency; both OpenGL
> > and Vulkan have concepts that effectively delimit
> > individual frames, and even a full-frame IMR is allowed to
> > batch the drawing up until you hit the "end of rendering"
> > command (be it glFlush(), glSwapBuffers(), or the more powerful Vulkan synchronization primitives).
>
> I don't see how "allowed" == "required". From the demo it appears that the ATI
> board simply draws the triangles as they appear, no latency involved (of course
> they could be waiting to receive all the triangles first, but that seems weird).
>
At least with AMD cards, I've seen them batch up the rendering commands until a glFlush(), glFinish() or eglSwapBuffers(); however, that does not add much latency (microseconds at most), because it still renders extremely quickly - it just waits for the application to indicate "you have all data for this frame" before it first sends the data to the hardware command queue, then schedules presentation of the frame for the first VSync interrupt after the hardware completes rendering.
There's thus no need for an entire frame of latency - you buffer until you get an "all rendering complete for this frame" marker, and then render at full speed ready to present at the earliest possible opportunity.