By: , June 3, 2013 9:51 am
Room: Moderated Discussions
rwessel (robertwessel.delete@this.yahoo.com) on May 31, 2013 3:11 pm wrote:
> Obviously the OS needs to support larger pages in order to make use of them, but in most implementations,
> an OS not wanting to use than can just ignore them. These are often called superpages.
>
> x86 CPUs have supported 2MB pages with the 8 byte (PAE style) page tables for quite some
> time. 4MB pages existed along with a different set of address space extensions, known
> as PSE, and these are largely irrelevant now - most OSs, even in 32 bit mode, use the
> PAE format page tables because it provides access to the "NX" page access bit).
>
> On x86 (and many other systems, but certainly not all), page tables are basically arranged in a tree. Non-PAE
> page tables in x86 use a 4K "page directory" pointing to 1024 (4KB) page tables, each of which contains a four
> byte page table entry, which maps one virtual page to a physical page (or says the mapping is invalid). That
> 1024 x 1024 leads to the million page tables entries needed to map the entire 32 bit address space.
>
> A separate copy of that tree exists for each address space
> (roughly synonymous with "process" in this context).
>
> An important factor is that not all 1024 entries in the page directory need be populated (for example, if
> an entire aligned 4MB region is unmapped, that can be handled my marking the entry in the page directory invalid,
> which avoids the need to provide for an entire 4KB page table with nothing but invalid slots in it. This
> considerably compresses the size of the page table tree (otherwise each address space would need 4MB of translation
> table). Shared areas are created by having two page tables map the same physical page(s).
>
> With PAE, page table entries are lengthened to eight bytes each (which leaves room for more physical address
> bits, as well as other control bits, like the NX bit). But since the page tables remain 4KB each, each
> level of page table only maps 512 entries (rather than 1024 in the non-PAE case), so that a single page
> directory entry only maps a 2MB region of memory (since it points to a page table that now has only 512
> entries). So there are more levels of page table (three, if you use PAE in 32 bit mode, up to four* total
> in 64 bit mode). The third level of page table maps 1GB per entry, the fourth 512GB per entry.
>
> The first (and smaller) advantage or large pages is that you can map large amounts of storage without having
> to construct page tables down to the 4KB page level. For example, with 2MB pages, you can map the entire
> 4GB address space with four page directory pages (totaling 16KB, plus 32 bytes for the PDPTEs). Newer processors
> have allowed pages to be defined at the third page table level as well, hence the 1GB pages.
>
> Logically, after the CPU generates the virtual address is walks the page table until if finds the
> page table entry needed to determine the physical page assignment. In practice doing that would be
> horrible (even with two level page tables, you've now tripled the number of memory accesses needed
> for any memory reference). That's why we have TLBs. TLBs are special caches that remember frequently
> used translations. SO if you accessed virtual page 12345, the first time the CPU would have to walk
> the page time to find out that that was actually on physical page 9876. The second time, however,
> that information will be available in the TLB, where it can be accessed much, much faster.
>
> Unfortunately there are two conflicting design goals for TLBs - large size and speed. A large TLB
> (or really, any cache) will necessarily be slow. The big advantage of large pages is that it can allow
> the TLB to map much more storage with the same number of entries. There have, of course been implementation
> details the mess things up some, for example, some implementations have used separate large and small
> page TLBs, with the large page TLBs being very small - indiscriminately making lots of large pages
> with a very small large page TLB will quickly cause that to thrash. So an OS needs to somewhat carefully
> manage large pages. The other downside to large pages is that they are, in fact, units - the OS is
> stuck dealing with the large page as a whole (IOW it cannot assign only a partial large page to an
> address space), which may lead to wasted memory if the OS is not careful.
>
>
>
> *That's not enough to map the entire 64 bit address space, but extending that
> to five or sex levels would be straight-forward enough, although there will
> be additional issues to deal with when address spaces get that large.
>
Thanks for your informative post!
- If an OS does not support super pages, does that mean it must use MANY 4KB entries for one large file? Also, must a superpage be used for one item? I dont quite understand what can be mapped under a 4 KB, 2 MB, 4 MB, and 1 GB page. Must every entry be no more than one file or a piece of file? Or can a 1 GB page contain multiple items?
- So the TLB is not ONLY a look-up board, but also a cache for data on its own? Doesnt this kind of reduce the point of an L1 data cache?
Thank you for your answers!
> Obviously the OS needs to support larger pages in order to make use of them, but in most implementations,
> an OS not wanting to use than can just ignore them. These are often called superpages.
>
> x86 CPUs have supported 2MB pages with the 8 byte (PAE style) page tables for quite some
> time. 4MB pages existed along with a different set of address space extensions, known
> as PSE, and these are largely irrelevant now - most OSs, even in 32 bit mode, use the
> PAE format page tables because it provides access to the "NX" page access bit).
>
> On x86 (and many other systems, but certainly not all), page tables are basically arranged in a tree. Non-PAE
> page tables in x86 use a 4K "page directory" pointing to 1024 (4KB) page tables, each of which contains a four
> byte page table entry, which maps one virtual page to a physical page (or says the mapping is invalid). That
> 1024 x 1024 leads to the million page tables entries needed to map the entire 32 bit address space.
>
> A separate copy of that tree exists for each address space
> (roughly synonymous with "process" in this context).
>
> An important factor is that not all 1024 entries in the page directory need be populated (for example, if
> an entire aligned 4MB region is unmapped, that can be handled my marking the entry in the page directory invalid,
> which avoids the need to provide for an entire 4KB page table with nothing but invalid slots in it. This
> considerably compresses the size of the page table tree (otherwise each address space would need 4MB of translation
> table). Shared areas are created by having two page tables map the same physical page(s).
>
> With PAE, page table entries are lengthened to eight bytes each (which leaves room for more physical address
> bits, as well as other control bits, like the NX bit). But since the page tables remain 4KB each, each
> level of page table only maps 512 entries (rather than 1024 in the non-PAE case), so that a single page
> directory entry only maps a 2MB region of memory (since it points to a page table that now has only 512
> entries). So there are more levels of page table (three, if you use PAE in 32 bit mode, up to four* total
> in 64 bit mode). The third level of page table maps 1GB per entry, the fourth 512GB per entry.
>
> The first (and smaller) advantage or large pages is that you can map large amounts of storage without having
> to construct page tables down to the 4KB page level. For example, with 2MB pages, you can map the entire
> 4GB address space with four page directory pages (totaling 16KB, plus 32 bytes for the PDPTEs). Newer processors
> have allowed pages to be defined at the third page table level as well, hence the 1GB pages.
>
> Logically, after the CPU generates the virtual address is walks the page table until if finds the
> page table entry needed to determine the physical page assignment. In practice doing that would be
> horrible (even with two level page tables, you've now tripled the number of memory accesses needed
> for any memory reference). That's why we have TLBs. TLBs are special caches that remember frequently
> used translations. SO if you accessed virtual page 12345, the first time the CPU would have to walk
> the page time to find out that that was actually on physical page 9876. The second time, however,
> that information will be available in the TLB, where it can be accessed much, much faster.
>
> Unfortunately there are two conflicting design goals for TLBs - large size and speed. A large TLB
> (or really, any cache) will necessarily be slow. The big advantage of large pages is that it can allow
> the TLB to map much more storage with the same number of entries. There have, of course been implementation
> details the mess things up some, for example, some implementations have used separate large and small
> page TLBs, with the large page TLBs being very small - indiscriminately making lots of large pages
> with a very small large page TLB will quickly cause that to thrash. So an OS needs to somewhat carefully
> manage large pages. The other downside to large pages is that they are, in fact, units - the OS is
> stuck dealing with the large page as a whole (IOW it cannot assign only a partial large page to an
> address space), which may lead to wasted memory if the OS is not careful.
>
>
>
> *That's not enough to map the entire 64 bit address space, but extending that
> to five or sex levels would be straight-forward enough, although there will
> be additional issues to deal with when address spaces get that large.
>
Thanks for your informative post!
- If an OS does not support super pages, does that mean it must use MANY 4KB entries for one large file? Also, must a superpage be used for one item? I dont quite understand what can be mapped under a 4 KB, 2 MB, 4 MB, and 1 GB page. Must every entry be no more than one file or a piece of file? Or can a 1 GB page contain multiple items?
- So the TLB is not ONLY a look-up board, but also a cache for data on its own? Doesnt this kind of reduce the point of an L1 data cache?
Thank you for your answers!