By: Chester (lamchester.delete@this.gmail.com), September 29, 2022 2:22 pm
Room: Moderated Discussions
--- (---.delete@this.redheron.com) on September 28, 2022 6:57 pm wrote:
> Chester, I have no idea what's wrong with your setup but when I try to comment (either
> vai Safari or via Chrome) I always get the complaint "Nonce verification failed".
That's weird, I've seen other people post comments. Cheese seemed to have no trouble approving those.
> Since I can't post there, and since it took some time to write this and it may
> be of interest, I'll post here. Even David may find the essential point of interest
> insofar as it touches on ML (tech) vs "ML" (advertising buzzword).
>
> ---------------------
>
> Do we have any details as to EXACTLY how ML/AI is used in this sort of upscaling?
Nope, Nvidia has been tight lipped as usual about exactly how DLSS works. As far as what they've said, it's a machine learning model. Inputs are an optical flow field, a previous full res frame, and a current frame rendered at reduced resolution. Outputs are a full res current frame, and for DLSS3, an intermediate frame as well.
Idk how exactly they trained it. Probably off reference full resolution frames
> I'd like to compare with the one case I do know something about, namely Apple.
> Apple makes a big deal about image fusion in their camera, for example introducing
> the brand name Photonic Engine for this operation in the iPhone14 class phones.
I think DLSS and Photonic engine are different enough that no comparison makes sense. Nvidia is trying to upscale and generate new frames in real time. Apple is post processing camera images.
> I want to contrast the journalist claims about Photonic Engine with Apple's claims:
> Journalist: "Photonic Engine leverages hardware inside the iPhone 14, iPhone 14 Plus, iPhone
> 14 Pro, and iPhone 14 Pro Max and applies some machine learning and iOS 16 software magic."
> Apple: "Then we added the all-new Photonic Engine, our game-changing image pipeline.
> It allows Deep Fusion — which merges the best pixels from multiple exposures into
> one phenomenal photo — to happen earlier in the process on uncompressed images.
> This preserves much more data to deliver brighter, more lifelike colors
> and beautifully detailed textures in less light than ever.'
Looks like a lot of marketing speak about a decades old technique of stacking images to reduce noise and increase dynamic range. Working "on uncompressed images" (raw files) is typical too, because you lose dynamic range once the image is processed to JPG.
> But what do the patents actually say? If you look at them, they are more or less "traditional"
> image processing, though in the wavelet rather than the fourier domain. The most interesting
> aspect for our purposes is that if we want to fuse two images, we do so by:
> - dividing the image into tiles
> - finding equivalent keypoints in each tile
> - finding a warp that maps each tile to its correspondent such that the keypoints align
> - fusing the warped image2 with image1.
Yeah, if you want to stack images, you want to find corresponding keypoints and transform the images so they line up. Nearly every modern cell phone seems to have some sort of image stacking mode, and they all seem to work well in most situations. Maybe they're using machine learning and the NPU to do alignment. Or maybe not. I don't know.
Maybe doing it in tiles saves processing power or improves cache efficiency. Probably relevant to a phone where battery life is a high priority.
> The reason I go through all this is because
> (a) it's pretty damn interesting, isn't it? :-)
It is, but it's also completely unrelated to what Nvidia is doing with DLSS.
> (b) it's very different, IMHO, from what we think of as AI/ML. It's not exactly a lie to say that
> AI/ML is involved, but it's also somewhat misleading, at least compared to what we think of as
> AI/ML. The important point is finding good, informative matching key features, not any sort of
> "recognition" or "aesthetic judgement based on scanning trillions of photos" or whatever.
> (c) my GUESS is that the sort of spatial and temporal upscaling described in the article essentially
> operates in the same way: find keypoints, find a warping function (now called optical flow;
If you stretch it, they're kind of related in that both optical flow and matching keypoints involve finding matching features across images. But beyond that it's very different. If you're stacking images and matching keypoints, you don't care about creating velocity vectors to predict where an image feature will move to next. All you care about is aligning the images so you can stack them without obvious ghosting.
In cell phones, your output image has to stand up to close scrutiny and only needs to be generated fast enough to prevent users from complaining. Obvious artifacts and ghosting need to be avoided at all costs, because they could trash an entire image. With NV DLSS, you need to get the image out in a matter of milliseconds, because time taken generating the image adds latency and nobody likes that when gaming. Artifacts should be minimized, but images only stay on screen for a few milliseconds before being replaced by an actually rendered frame. So, occasional artifacts (and early DLSS3 videos do show them) are probably acceptable as long as they don't happen too often.
If you want to compare Apple's Photonic Engine to something, I think Google's Pixel lineup is a good start. They definitely have several image stacking modes (night sight, HDR+), and do a very good job of aligning images even if you aren't steady with holding the phone.
> Chester, I have no idea what's wrong with your setup but when I try to comment (either
> vai Safari or via Chrome) I always get the complaint "Nonce verification failed".
That's weird, I've seen other people post comments. Cheese seemed to have no trouble approving those.
> Since I can't post there, and since it took some time to write this and it may
> be of interest, I'll post here. Even David may find the essential point of interest
> insofar as it touches on ML (tech) vs "ML" (advertising buzzword).
>
> ---------------------
>
> Do we have any details as to EXACTLY how ML/AI is used in this sort of upscaling?
Nope, Nvidia has been tight lipped as usual about exactly how DLSS works. As far as what they've said, it's a machine learning model. Inputs are an optical flow field, a previous full res frame, and a current frame rendered at reduced resolution. Outputs are a full res current frame, and for DLSS3, an intermediate frame as well.
Idk how exactly they trained it. Probably off reference full resolution frames
> I'd like to compare with the one case I do know something about, namely Apple.
> Apple makes a big deal about image fusion in their camera, for example introducing
> the brand name Photonic Engine for this operation in the iPhone14 class phones.
I think DLSS and Photonic engine are different enough that no comparison makes sense. Nvidia is trying to upscale and generate new frames in real time. Apple is post processing camera images.
> I want to contrast the journalist claims about Photonic Engine with Apple's claims:
> Journalist: "Photonic Engine leverages hardware inside the iPhone 14, iPhone 14 Plus, iPhone
> 14 Pro, and iPhone 14 Pro Max and applies some machine learning and iOS 16 software magic."
> Apple: "Then we added the all-new Photonic Engine, our game-changing image pipeline.
> It allows Deep Fusion — which merges the best pixels from multiple exposures into
> one phenomenal photo — to happen earlier in the process on uncompressed images.
> This preserves much more data to deliver brighter, more lifelike colors
> and beautifully detailed textures in less light than ever.'
Looks like a lot of marketing speak about a decades old technique of stacking images to reduce noise and increase dynamic range. Working "on uncompressed images" (raw files) is typical too, because you lose dynamic range once the image is processed to JPG.
> But what do the patents actually say? If you look at them, they are more or less "traditional"
> image processing, though in the wavelet rather than the fourier domain. The most interesting
> aspect for our purposes is that if we want to fuse two images, we do so by:
> - dividing the image into tiles
> - finding equivalent keypoints in each tile
> - finding a warp that maps each tile to its correspondent such that the keypoints align
> - fusing the warped image2 with image1.
Yeah, if you want to stack images, you want to find corresponding keypoints and transform the images so they line up. Nearly every modern cell phone seems to have some sort of image stacking mode, and they all seem to work well in most situations. Maybe they're using machine learning and the NPU to do alignment. Or maybe not. I don't know.
Maybe doing it in tiles saves processing power or improves cache efficiency. Probably relevant to a phone where battery life is a high priority.
> The reason I go through all this is because
> (a) it's pretty damn interesting, isn't it? :-)
It is, but it's also completely unrelated to what Nvidia is doing with DLSS.
> (b) it's very different, IMHO, from what we think of as AI/ML. It's not exactly a lie to say that
> AI/ML is involved, but it's also somewhat misleading, at least compared to what we think of as
> AI/ML. The important point is finding good, informative matching key features, not any sort of
> "recognition" or "aesthetic judgement based on scanning trillions of photos" or whatever.
> (c) my GUESS is that the sort of spatial and temporal upscaling described in the article essentially
> operates in the same way: find keypoints, find a warping function (now called optical flow;
If you stretch it, they're kind of related in that both optical flow and matching keypoints involve finding matching features across images. But beyond that it's very different. If you're stacking images and matching keypoints, you don't care about creating velocity vectors to predict where an image feature will move to next. All you care about is aligning the images so you can stack them without obvious ghosting.
In cell phones, your output image has to stand up to close scrutiny and only needs to be generated fast enough to prevent users from complaining. Obvious artifacts and ghosting need to be avoided at all costs, because they could trash an entire image. With NV DLSS, you need to get the image out in a matter of milliseconds, because time taken generating the image adds latency and nobody likes that when gaming. Artifacts should be minimized, but images only stay on screen for a few milliseconds before being replaced by an actually rendered frame. So, occasional artifacts (and early DLSS3 videos do show them) are probably acceptable as long as they don't happen too often.
If you want to compare Apple's Photonic Engine to something, I think Google's Pixel lineup is a good start. They definitely have several image stacking modes (night sight, HDR+), and do a very good job of aligning images even if you aren't steady with holding the phone.
Topic | Posted By | Date |
---|---|---|
For Chester: Nvidia’s RTX 4090 Launch | --- | 2022/09/28 06:57 PM |
For Chester: Nvidia’s RTX 4090 Launch | Chester | 2022/09/29 02:22 PM |
For Chester: Nvidia’s RTX 4090 Launch | --- | 2022/09/29 09:37 PM |
For Chester: Nvidia’s RTX 4090 Launch | Chester | 2022/09/30 11:04 AM |