By: rwessel (robertwessel.delete@this.yahoo.com), June 3, 2013 10:45 am
Room: Moderated Discussions
Sebastian Soeiro (sebastian_2896.delete@this.hotmail.com) on June 3, 2013 9:51 am wrote:
> rwessel (robertwessel.delete@this.yahoo.com) on May 31, 2013 3:11 pm wrote:
> > Obviously the OS needs to support larger pages in order to make use of them, but in most implementations,
> > an OS not wanting to use than can just ignore them. These are often called superpages.
> >
> > x86 CPUs have supported 2MB pages with the 8 byte (PAE style) page tables for quite some
> > time. 4MB pages existed along with a different set of address space extensions, known
> > as PSE, and these are largely irrelevant now - most OSs, even in 32 bit mode, use the
> > PAE format page tables because it provides access to the "NX" page access bit).
> >
> > On x86 (and many other systems, but certainly not all),
> > page tables are basically arranged in a tree. Non-PAE
> > page tables in x86 use a 4K "page directory" pointing to
> > 1024 (4KB) page tables, each of which contains a four
> > byte page table entry, which maps one virtual page to a
> > physical page (or says the mapping is invalid). That
> > 1024 x 1024 leads to the million page tables entries needed to map the entire 32 bit address space.
> >
> > A separate copy of that tree exists for each address space
> > (roughly synonymous with "process" in this context).
> >
> > An important factor is that not all 1024 entries in the page directory need be populated (for example, if
> > an entire aligned 4MB region is unmapped, that can be handled
> > my marking the entry in the page directory invalid,
> > which avoids the need to provide for an entire 4KB page table with nothing but invalid slots in it. This
> > considerably compresses the size of the page table tree (otherwise
> > each address space would need 4MB of translation
> > table). Shared areas are created by having two page tables map the same physical page(s).
> >
> > With PAE, page table entries are lengthened to eight bytes
> > each (which leaves room for more physical address
> > bits, as well as other control bits, like the NX bit). But since the page tables remain 4KB each, each
> > level of page table only maps 512 entries (rather than 1024 in the non-PAE case), so that a single page
> > directory entry only maps a 2MB region of memory (since it points to a page table that now has only 512
> > entries). So there are more levels of page table (three, if you use PAE in 32 bit mode, up to four* total
> > in 64 bit mode). The third level of page table maps 1GB per entry, the fourth 512GB per entry.
> >
> > The first (and smaller) advantage or large pages is that
> > you can map large amounts of storage without having
> > to construct page tables down to the 4KB page level. For example, with 2MB pages, you can map the entire
> > 4GB address space with four page directory pages (totaling
> > 16KB, plus 32 bytes for the PDPTEs). Newer processors
> > have allowed pages to be defined at the third page table level as well, hence the 1GB pages.
> >
> > Logically, after the CPU generates the virtual address is walks the page table until if finds the
> > page table entry needed to determine the physical page assignment. In practice doing that would be
> > horrible (even with two level page tables, you've now tripled the number of memory accesses needed
> > for any memory reference). That's why we have TLBs. TLBs are special caches that remember frequently
> > used translations. SO if you accessed virtual page 12345, the first time the CPU would have to walk
> > the page time to find out that that was actually on physical page 9876. The second time, however,
> > that information will be available in the TLB, where it can be accessed much, much faster.
> >
> > Unfortunately there are two conflicting design goals for TLBs - large size and speed. A large TLB
> > (or really, any cache) will necessarily be slow. The big advantage of large pages is that it can allow
> > the TLB to map much more storage with the same number of
> > entries. There have, of course been implementation
> > details the mess things up some, for example, some implementations have used separate large and small
> > page TLBs, with the large page TLBs being very small - indiscriminately making lots of large pages
> > with a very small large page TLB will quickly cause that to thrash. So an OS needs to somewhat carefully
> > manage large pages. The other downside to large pages is that they are, in fact, units - the OS is
> > stuck dealing with the large page as a whole (IOW it cannot assign only a partial large page to an
> > address space), which may lead to wasted memory if the OS is not careful.
> >
> >
> >
> > *That's not enough to map the entire 64 bit address space, but extending that
> > to five or sex levels would be straight-forward enough, although there will
> > be additional issues to deal with when address spaces get that large.
> >
>
> Thanks for your informative post!
>
> - If an OS does not support super pages, does that mean it must use MANY 4KB entries
> for one large file? Also, must a superpage be used for one item? I dont quite understand
> what can be mapped under a 4 KB, 2 MB, 4 MB, and 1 GB page. Must every entry be no more
> than one file or a piece of file? Or can a 1 GB page contain multiple items?
When used for file mapping, you'd expect a page (of whatever size) to be used to map a single aligned page of a single file. Similarly for general use, a 1GB page is equivalent to a complete set of 512 2MB pages or 262,144 4KB pages (again aligned to the larger page boundary). Using the large pages means that you have to map the *entire* area of the large page. That's fine if you're actually using the whole area mapped by the page, but if you're not, using a number of smaller pages instead can significantly reduce the commitment of physical memory. For example, I just looked at Outlook in Task Manager, and it was using ~87MB of virtual memory and ~70MB of physical memory. If the OS had attempted to map that address space with 1GB pages, several GB of physical memory (you'd need at least two - one marked NX for data, one for code), would be dedicated for Outlook (and unusable for any other process).
For the most part the internal fragmentation caused by large pages significantly limits where they can be applied.
> - So the TLB is not ONLY a look-up board, but also a cache for data on
> its own? Doesnt this kind of reduce the point of an L1 data cache?
The TLB is a cache, but only for translations (IOW data from the translation tables, often cooked a fair bit). General data does not go in there.
TLBs tend to be rather smaller, but faster, than L1Ds, and also tend not to be memory coherent (and thus need some explicit management by the OS).
> rwessel (robertwessel.delete@this.yahoo.com) on May 31, 2013 3:11 pm wrote:
> > Obviously the OS needs to support larger pages in order to make use of them, but in most implementations,
> > an OS not wanting to use than can just ignore them. These are often called superpages.
> >
> > x86 CPUs have supported 2MB pages with the 8 byte (PAE style) page tables for quite some
> > time. 4MB pages existed along with a different set of address space extensions, known
> > as PSE, and these are largely irrelevant now - most OSs, even in 32 bit mode, use the
> > PAE format page tables because it provides access to the "NX" page access bit).
> >
> > On x86 (and many other systems, but certainly not all),
> > page tables are basically arranged in a tree. Non-PAE
> > page tables in x86 use a 4K "page directory" pointing to
> > 1024 (4KB) page tables, each of which contains a four
> > byte page table entry, which maps one virtual page to a
> > physical page (or says the mapping is invalid). That
> > 1024 x 1024 leads to the million page tables entries needed to map the entire 32 bit address space.
> >
> > A separate copy of that tree exists for each address space
> > (roughly synonymous with "process" in this context).
> >
> > An important factor is that not all 1024 entries in the page directory need be populated (for example, if
> > an entire aligned 4MB region is unmapped, that can be handled
> > my marking the entry in the page directory invalid,
> > which avoids the need to provide for an entire 4KB page table with nothing but invalid slots in it. This
> > considerably compresses the size of the page table tree (otherwise
> > each address space would need 4MB of translation
> > table). Shared areas are created by having two page tables map the same physical page(s).
> >
> > With PAE, page table entries are lengthened to eight bytes
> > each (which leaves room for more physical address
> > bits, as well as other control bits, like the NX bit). But since the page tables remain 4KB each, each
> > level of page table only maps 512 entries (rather than 1024 in the non-PAE case), so that a single page
> > directory entry only maps a 2MB region of memory (since it points to a page table that now has only 512
> > entries). So there are more levels of page table (three, if you use PAE in 32 bit mode, up to four* total
> > in 64 bit mode). The third level of page table maps 1GB per entry, the fourth 512GB per entry.
> >
> > The first (and smaller) advantage or large pages is that
> > you can map large amounts of storage without having
> > to construct page tables down to the 4KB page level. For example, with 2MB pages, you can map the entire
> > 4GB address space with four page directory pages (totaling
> > 16KB, plus 32 bytes for the PDPTEs). Newer processors
> > have allowed pages to be defined at the third page table level as well, hence the 1GB pages.
> >
> > Logically, after the CPU generates the virtual address is walks the page table until if finds the
> > page table entry needed to determine the physical page assignment. In practice doing that would be
> > horrible (even with two level page tables, you've now tripled the number of memory accesses needed
> > for any memory reference). That's why we have TLBs. TLBs are special caches that remember frequently
> > used translations. SO if you accessed virtual page 12345, the first time the CPU would have to walk
> > the page time to find out that that was actually on physical page 9876. The second time, however,
> > that information will be available in the TLB, where it can be accessed much, much faster.
> >
> > Unfortunately there are two conflicting design goals for TLBs - large size and speed. A large TLB
> > (or really, any cache) will necessarily be slow. The big advantage of large pages is that it can allow
> > the TLB to map much more storage with the same number of
> > entries. There have, of course been implementation
> > details the mess things up some, for example, some implementations have used separate large and small
> > page TLBs, with the large page TLBs being very small - indiscriminately making lots of large pages
> > with a very small large page TLB will quickly cause that to thrash. So an OS needs to somewhat carefully
> > manage large pages. The other downside to large pages is that they are, in fact, units - the OS is
> > stuck dealing with the large page as a whole (IOW it cannot assign only a partial large page to an
> > address space), which may lead to wasted memory if the OS is not careful.
> >
> >
> >
> > *That's not enough to map the entire 64 bit address space, but extending that
> > to five or sex levels would be straight-forward enough, although there will
> > be additional issues to deal with when address spaces get that large.
> >
>
> Thanks for your informative post!
>
> - If an OS does not support super pages, does that mean it must use MANY 4KB entries
> for one large file? Also, must a superpage be used for one item? I dont quite understand
> what can be mapped under a 4 KB, 2 MB, 4 MB, and 1 GB page. Must every entry be no more
> than one file or a piece of file? Or can a 1 GB page contain multiple items?
When used for file mapping, you'd expect a page (of whatever size) to be used to map a single aligned page of a single file. Similarly for general use, a 1GB page is equivalent to a complete set of 512 2MB pages or 262,144 4KB pages (again aligned to the larger page boundary). Using the large pages means that you have to map the *entire* area of the large page. That's fine if you're actually using the whole area mapped by the page, but if you're not, using a number of smaller pages instead can significantly reduce the commitment of physical memory. For example, I just looked at Outlook in Task Manager, and it was using ~87MB of virtual memory and ~70MB of physical memory. If the OS had attempted to map that address space with 1GB pages, several GB of physical memory (you'd need at least two - one marked NX for data, one for code), would be dedicated for Outlook (and unusable for any other process).
For the most part the internal fragmentation caused by large pages significantly limits where they can be applied.
> - So the TLB is not ONLY a look-up board, but also a cache for data on
> its own? Doesnt this kind of reduce the point of an L1 data cache?
The TLB is a cache, but only for translations (IOW data from the translation tables, often cooked a fair bit). General data does not go in there.
TLBs tend to be rather smaller, but faster, than L1Ds, and also tend not to be memory coherent (and thus need some explicit management by the OS).