Pre-populating anonymous pages

By: Travis Downs (travis.downs.delete@this.gmail.com), June 8, 2019 8:18 am
Room: Moderated Discussions
Brendan (btrotter.delete@this.gmail.com) on June 8, 2019 2:55 am wrote:
> Hi,
>
> Travis Downs (travis.downs.delete@this.gmail.com) on June 7, 2019 2:16 pm wrote:
> > Here's a totally artificial benchmark, but I get 0.33s for THP, 1.0s
> > for MAP_POPULATE 4k pages and 1.9s w/o MAP_POPULATE 4k pages.
>
> Let me see if I understand this correctly. My assumptions are:
>
> a) When a process calls "mmap()" the kernel looks at various things (primarily how much physical
> RAM is currently free, but also things like whether meltdown mitigation are present, how fast/slow
> swap space is, etc) to try to estimate the optimum number of pages to pre-populate.

No, as far as I know (and have observed) it does not populate any pages.


> b) The MAP_POPULATE is merely a hint that influences the kernel's "optimum number of pages to pre-populate"
> estimation; partly because it's unreasonable to expect user-space to take everything into account
> itself, and partly because you can't shove "how many percent" into a 1-bit flag anyway.

No, as far as I know (and have observed) it populates all pages. Of course, if there is not enough RAM, something has to give. Maybe it's smart about it and populates a fraction in that case, or maybe it just keeps populating the whole thing and earlier pages will start to get swapped out as this is happening. In any case doing a giant MAP_POPULATE seems like a bad idea.

>
> c) The kernel may opportunistically use large/huge pages when pre-populating
> however many pages it estimated as the optimum amount.

Yes, although as mentioned there is not really an estimation as you suggest.

The behavior during population probably depends on the /sys/kernel/mm/transparent_hugepage/enabled setting. The trend seemed to originally be to have this set to "always", which means for a suitably aligned region hugepages could be used, if available, for this, but recently many distros (well at least Debian & derivatives, which is a lot) have switched to "madvise" which I'm pretty sure means huge pages won't be used ever at mmap time: you need to madvise the region first with MADV_HUGEPAGE to get huge pages (in practice this works reliably if you do it before populating the region, on my system with a lot of free RAM: how likely it is to work to change the status of existing pages if you madvise on top of an existing region, I'm not sure).

In case you haven't read it yet, this is basically the manual for THP.

>
> d) The kernel may "de-populate" later (e.g. in an attempt to avoid using swap space when there's
> a large increase in physical memory usage after an area was already pre-populated).

Kind of, although I think this is just part of the normal "swapping out" algorithm. If there is memory pressure (as determined by a complex heuristic), it will find some pages to "swap out". Now this might not actually involve "swapping" (writing something to disk), if the pages are unmodified and backed by a file, so I think those pages are preferred. This would actually be the case for pages mapped to the zero page (this happens if you read from an mapped-but-unpopulated page: it first gets mapped to the shared zero page, then on a write you actually get your real unique anonymous page), so I guess those pages could be reclaimed w/o writing anything.

For mapped pages with MAP_POPULATE or actually written to pages, I don't see how the kernel could avoid writing out the pages, so it's going to use "swap space" if it happens: once the writable page has been mapped, I think it is used from the kernels point of view: how is it going to know it doesn't have data that needs to be preserved? Maybe there is a special check for all zeros which can optimize out the write, I dunno.

>
> e) When a "not populated yet" page is modified (causing a page fault) the page fault handler will populate
> that page, but may also pre-populate other pages in an attempt to avoid "likely future page faults",

No, not today for anonymous pages. This feature is falled fault-around and works for file-backed pages, not anonymous pages (anonymous pages being MAP_ANONYMOUS and the kind malloc will use to get memory from the kernel).

> possibly including tracking history (e.g. some kind of "consecutive write detector") and using it (in
> conjunction with other information, like how much physical RAM is free now and if meltdown mitigations
> are being done) to determine how much extra to pre-populate during that page fault.

AFAIK the fault-around logic is simple, it just maps in a certain fixed number of pages around the current location if they are available. The number is adjustable with a debugfs tunable: fault_around_order. If you google Linux "fault around" you'll find good info.

>
> f) The results you've shown (" 0.33s for THP, 1.0s for MAP_POPULATE 4k pages and 1.9s w/o MAP_POPULATE
> 4k pages") are a strong indicator that Linux failed to implement some or all of the above correctly;
> because the performance difference between these cases should be significantly smaller.

No, because as above most of your assumptions about how it works are wrong. At least, I understood you were asking about how it works, but maybe that was all how you think it should work?

I'm pretty sure Linus (even new Linus?) will go apoplectic if anyone suggests putting complicated predictors into the kernel fault-path... but that said, maybe there is room for a adaptive-but-very-dumb fault-around for anonymous pages?

I think the concern is that mapping in nearby pages for anonymous mappings is very different than the file backed case. In the file backed case it maps in pages that already exist in the page cache, i.e., are already in RAM. So essentially there is basically no additional memory use (outside of maybe a few bytes for additional PTE entries). There is no equivalent for anonymous pages, so in a sparse case, where the process is just touching a few pages out of a large region, a 16x fault-around could increase memory use by 16x.

However, you could track the actual behavior of the application and capture simple cases like linearly mapping in one file after another, and be like "Listen, the application has just mapped in 100 consecutive pages, maybe some fault around (ahead) is OK". If you make your heuristic conservative, you could put a strong bound on the amount of wasted memory this incurs.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Pre-populating anonymous pagesTravis Downs2019/06/05 04:48 PM
  Pre-populating anonymous pagesJeff S.2019/06/05 08:03 PM
    Pre-populating anonymous pagesTravis Downs2019/06/06 07:11 AM
      Pre-populating anonymous pagesJeff S.2019/06/06 08:40 AM
        Pre-populating anonymous pagesTravis Downs2019/06/06 08:59 AM
          Pre-populating anonymous pagesJeff S.2019/06/06 09:19 AM
  Pre-populating anonymous pagesFoo_2019/06/06 12:30 AM
    Pre-populating anonymous pagesTravis Downs2019/06/06 06:59 AM
      Pre-populating anonymous pagesFoo_2019/06/06 07:56 AM
        Pre-populating anonymous pagesTravis Downs2019/06/06 09:02 AM
  Pre-populating anonymous pagesLinus Torvalds2019/06/06 11:01 AM
    Pre-populating anonymous pagesTravis Downs2019/06/07 02:16 PM
      Pre-populating anonymous pagesBrendan2019/06/08 02:55 AM
        Pre-populating anonymous pagesTravis Downs2019/06/08 08:18 AM
        Pre-populating anonymous pagesLinus Torvalds2019/06/08 11:43 AM
          Pre-populating anonymous pagesBrendan2019/06/09 03:29 AM
            Pre-populating anonymous pagesLinus Torvalds2019/06/10 11:20 AM
          Pre-populating anonymous pagesTravis Downs2019/06/17 09:18 AM
            Pre-populating anonymous pagesLinus Torvalds2019/06/18 04:28 PM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell avocado?