By: Maynard Handley (name99.delete@this.name99.org), May 14, 2013 4:49 pm
Room: Moderated Discussions
Paul A. Clayton (paaronclayton.delete@this.gmail.com) on May 6, 2013 9:27 pm wrote:
> none (none.delete@this.none.com) on May 6, 2013 3:32 pm wrote:
> [snip]
> > I'm also surprised to find 16 entries for 2 MB pages. Are
> > such pages that often in use? Vram and kernel mapping?
>
> There is, of course, the highly unlikely possibility that the huge page TLB entries are also used to cache
> PDEs (as I have often suggested). If that was the case, the number of entries might be even larger and
> still be reasonably useful. Since neither Intel nor AMD have ever done this (but instead used separate
> caching structures for PDEs et al.), I am extremely skeptical that such flexibility is supported.
>
> For servers, 2MiB pages might find significant use. I would also not be surprised if a unified
> VM (Davlik??) or a web browser could exploit 2MiB pages even with only 1GiB of memory.
There is trivial scope for wider use of 2MiB pages. For example, OSX uses a magazine malloc, which uses three zones for small, medium and large allocations. The small and medium zones are initially allocated as 2MiB in size, and grow by 2MiB a pop. (Large allocations are allocated as some number of 4KiB pages.)
It would (IMHO) be an appropriate optimization to make these small and medium zones allocated right off the bat as 2MiB pages. (For all I know, the latest OSX does this, but Apple is so secretive that WTF knows.)
As I see it, even without worrying about auto-consolidating and fragmenting 4KiB pages there is scope for
- use of 1GiB pages to cover VRAM (not everyone has 1GiB of VRAM yet, but we're headed there)
- to cover wired OS data (again, maybe doesn't make sense for less than 16 or 24 GiB of RAM, but we're headed there)
- use of 2MiB pages for certain common heap structures, as I've described
- use of 2MiB pages for at least the code sections of common DLLs (the idea being that enough clients are using various different bits that there's no point in worrying about the parts that wouldn't be paged in)
My guess is the statistics would show that for most stacks 2MiB is suboptimal and 4KiB is better. For app code, BSS, various different globals I have no idea what the stats would say.
I think that Apple and MS are both in a position to instrument their apps and libraries to get this right for their code, and to take stats of running 3rd party apps and make the appropriate decision. It seems that with LLVM, link time optimizations are finally being taken seriously again by Apple (Apple had great tools for this towards the end of the MacOS era, which were abandoned with OSX.) LLVM also has (more experimental) facilities for the kind of run-time profiling I'm suggesting.
So... the delay is frustrating, but I live in hope that, in two or three years we'll see a 15% or so across the board speed boost here, coming from a combination of optimally link-time-packed binaries and substantially smarter use of page sizes for loading those binaries and creating their in-memory sections.
> none (none.delete@this.none.com) on May 6, 2013 3:32 pm wrote:
> [snip]
> > I'm also surprised to find 16 entries for 2 MB pages. Are
> > such pages that often in use? Vram and kernel mapping?
>
> There is, of course, the highly unlikely possibility that the huge page TLB entries are also used to cache
> PDEs (as I have often suggested). If that was the case, the number of entries might be even larger and
> still be reasonably useful. Since neither Intel nor AMD have ever done this (but instead used separate
> caching structures for PDEs et al.), I am extremely skeptical that such flexibility is supported.
>
> For servers, 2MiB pages might find significant use. I would also not be surprised if a unified
> VM (Davlik??) or a web browser could exploit 2MiB pages even with only 1GiB of memory.
There is trivial scope for wider use of 2MiB pages. For example, OSX uses a magazine malloc, which uses three zones for small, medium and large allocations. The small and medium zones are initially allocated as 2MiB in size, and grow by 2MiB a pop. (Large allocations are allocated as some number of 4KiB pages.)
It would (IMHO) be an appropriate optimization to make these small and medium zones allocated right off the bat as 2MiB pages. (For all I know, the latest OSX does this, but Apple is so secretive that WTF knows.)
As I see it, even without worrying about auto-consolidating and fragmenting 4KiB pages there is scope for
- use of 1GiB pages to cover VRAM (not everyone has 1GiB of VRAM yet, but we're headed there)
- to cover wired OS data (again, maybe doesn't make sense for less than 16 or 24 GiB of RAM, but we're headed there)
- use of 2MiB pages for certain common heap structures, as I've described
- use of 2MiB pages for at least the code sections of common DLLs (the idea being that enough clients are using various different bits that there's no point in worrying about the parts that wouldn't be paged in)
My guess is the statistics would show that for most stacks 2MiB is suboptimal and 4KiB is better. For app code, BSS, various different globals I have no idea what the stats would say.
I think that Apple and MS are both in a position to instrument their apps and libraries to get this right for their code, and to take stats of running 3rd party apps and make the appropriate decision. It seems that with LLVM, link time optimizations are finally being taken seriously again by Apple (Apple had great tools for this towards the end of the MacOS era, which were abandoned with OSX.) LLVM also has (more experimental) facilities for the kind of run-time profiling I'm suggesting.
So... the delay is frustrating, but I live in hope that, in two or three years we'll see a 15% or so across the board speed boost here, coming from a combination of optimally link-time-packed binaries and substantially smarter use of page sizes for loading those binaries and creating their in-memory sections.