By: rwessel (robertwessel.delete@this.yahoo.com), May 31, 2013 3:11 pm
Room: Moderated Discussions
Sebastian Soeiro (sebastian_2896.delete@this.hotmail.com) on May 31, 2013 2:26 pm wrote:
> rwessel (robertwessel.delete@this.yahoo.com) on May 31, 2013 10:59 am wrote:
> > Sebastian Soeiro (sebastian_2896.delete@this.hotmail.com) on May 31, 2013 8:01 am wrote:
> > > Klimax (danklima.delete@this.gmail.com) on May 31, 2013 7:15 am wrote:
> > > > Simple example: Due to bug (like unchecked end of array) it will write past allocated range. (Or just
> > > > randomly all over memory) And in process it will destroy whatever it can including kernel structure.
> > > > For crazy reading about crazy programs I recommend http://ptgmedia.pearsoncmg.com/images/9780321440303/samplechapter/Chen_bonus_ch02.pdf
> > > > Bonus chapter for book "The Old New Thing" by Raymond Chen.
> > > >
> > > > As for contiguous memory - arrays. (Enables simple and fast way to access
> > > > any element while knowing only starting address and size of structure)
> > >
> > > Thank you for your reply;
> > >
> > > Now I understand how virtual addresses can help arrays through contiguous addresses. I
> > > suppose where the physical addresses is the DTLB's problem; but the ALU math is probably
> > > vastly simplified by having contiguous virtual memory to calculate like (array+2) instead
> > > of (YMM15*14/2^3), and there are probably other benefits that I'm not even seeing.
> > >
> > > However, one question. Say an array has an error and just continues writing more and more and more
> > > elements to the array, filling memory. How would having contiguous virtual addresses prevent this
> > > from happening? I can see how it'd make it much easier to "clear" the memory of these errored entries,
> > > but is there a mechanism in place to detect if this error array is happening to stop it?
> >
> >
> > By itself, virtual address translation won't stop it, but eventually the bad application
> > will attempt to write to an address that is not valid in that process's (virtual) address
> > space, and the CPU will generate an exception, which the OS will catch, and then (usually)
> > terminate the offending process. When the offending process is terminated, all of its resources,
> > including memory allocations, are released by the OS, which tracks such things.
> >
> > Note that on most processors virtual addresses and physical address are very similar, occupying a flat
> > series of consecutive addresses. For example, on most 32 bit processors (supporting VM), addresses, both
> > virtual and physical, range from 0 to 4,294,967,295 (4GiB). Those address spaces are commonly divided
> > into 1,048,576 consecutive pages of 4096 bytes (4KiB). Many processors can turn off address translation,
> > and the any running code would then just access a physical address. Address translations basically sets
> > up a table to translate those million virtual page numbers for a process to physical page numbers. With
> > address translation on, a process wanting (say) 12KB of memory might get that assigned to virtual pages
> > 100, 101 and 102, and thus would access those as 12K consecutive addresses starting at (virtual) address
> > 409,600. Those three virtual pages might be mapped to physical pages 1000, 9876, and 123467, at the OS's
> > whim. Absent a specific request to share them, those three physical pages would *not* be mapped into any
> > other process's virtual address space, and so would only
> > be accessible to the one process. And while actual
> > virtual memory is less relevant these days, the OS might need more physical pages for one process and it
> > might page one of those virtual pages to disk, freeing up the associated physical page. When the process
> > tries to access that paged-out page, the OS will catch the associated exception, read the needed physical
> > page back into memory, fix up the translation, and let the application continue.
> >
> > In some situations, guard pages are generated adjacent to certain areas of (virtual) memory,
> > which allows running off the end of arrays to be caught fairly quickly. It's a limited solution
> > at best, since pages are too large to waste on small objects in most cases.
> >
> > But having a separate virtual address space prevents the errant process for trashing the
> > memory assigned to *other* processes. So while the buggy word processor dies a horrible
> > death, taking your document with it, the spreadsheet you were editing at the same time continues
> > running as if nothing had happened. Of course some processes cooperate and communicate with
> > each other, and will often react badly to one of the partners suddenly dying.
> >
>
> Ah, that all makes sense! So the OS can detect when things go
> rogue and kills off buggy processes! That's quite logical!
>
> Everything you said makes sense, and I definitely am learning
> quite a bit from it. But I have one outstanding question;
>
> You said that the OS usually uses 4KB pages, right? Well, how come Haswell has support for 4MB and
> 1GB pages? What benefit do these extra page sizes grant if the OS can only assign 4KB pages?
>
> Thank you for your very informative reply!
Obviously the OS needs to support larger pages in order to make use of them, but in most implementations, an OS not wanting to use than can just ignore them. These are often called superpages.
x86 CPUs have supported 2MB pages with the 8 byte (PAE style) page tables for quite some time. 4MB pages existed along with a different set of address space extensions, known as PSE, and these are largely irrelevant now - most OSs, even in 32 bit mode, use the PAE format page tables because it provides access to the "NX" page access bit).
On x86 (and many other systems, but certainly not all), page tables are basically arranged in a tree. Non-PAE page tables in x86 use a 4K "page directory" pointing to 1024 (4KB) page tables, each of which contains a four byte page table entry, which maps one virtual page to a physical page (or says the mapping is invalid). That 1024 x 1024 leads to the million page tables entries needed to map the entire 32 bit address space.
A separate copy of that tree exists for each address space (roughly synonymous with "process" in this context).
An important factor is that not all 1024 entries in the page directory need be populated (for example, if an entire aligned 4MB region is unmapped, that can be handled my marking the entry in the page directory invalid, which avoids the need to provide for an entire 4KB page table with nothing but invalid slots in it. This considerably compresses the size of the page table tree (otherwise each address space would need 4MB of translation table). Shared areas are created by having two page tables map the same physical page(s).
With PAE, page table entries are lengthened to eight bytes each (which leaves room for more physical address bits, as well as other control bits, like the NX bit). But since the page tables remain 4KB each, each level of page table only maps 512 entries (rather than 1024 in the non-PAE case), so that a single page directory entry only maps a 2MB region of memory (since it points to a page table that now has only 512 entries). So there are more levels of page table (three, if you use PAE in 32 bit mode, up to four* total in 64 bit mode). The third level of page table maps 1GB per entry, the fourth 512GB per entry.
The first (and smaller) advantage or large pages is that you can map large amounts of storage without having to construct page tables down to the 4KB page level. For example, with 2MB pages, you can map the entire 4GB address space with four page directory pages (totaling 16KB, plus 32 bytes for the PDPTEs). Newer processors have allowed pages to be defined at the third page table level as well, hence the 1GB pages.
Logically, after the CPU generates the virtual address is walks the page table until if finds the page table entry needed to determine the physical page assignment. In practice doing that would be horrible (even with two level page tables, you've now tripled the number of memory accesses needed for any memory reference). That's why we have TLBs. TLBs are special caches that remember frequently used translations. SO if you accessed virtual page 12345, the first time the CPU would have to walk the page time to find out that that was actually on physical page 9876. The second time, however, that information will be available in the TLB, where it can be accessed much, much faster.
Unfortunately there are two conflicting design goals for TLBs - large size and speed. A large TLB (or really, any cache) will necessarily be slow. The big advantage of large pages is that it can allow the TLB to map much more storage with the same number of entries. There have, of course been implementation details the mess things up some, for example, some implementations have used separate large and small page TLBs, with the large page TLBs being very small - indiscriminately making lots of large pages with a very small large page TLB will quickly cause that to thrash. So an OS needs to somewhat carefully manage large pages. The other downside to large pages is that they are, in fact, units - the OS is stuck dealing with the large page as a whole (IOW it cannot assign only a partial large page to an address space), which may lead to wasted memory if the OS is not careful.
*That's not enough to map the entire 64 bit address space, but extending that to five or sex levels would be straight-forward enough, although there will be additional issues to deal with when address spaces get that large.
> rwessel (robertwessel.delete@this.yahoo.com) on May 31, 2013 10:59 am wrote:
> > Sebastian Soeiro (sebastian_2896.delete@this.hotmail.com) on May 31, 2013 8:01 am wrote:
> > > Klimax (danklima.delete@this.gmail.com) on May 31, 2013 7:15 am wrote:
> > > > Simple example: Due to bug (like unchecked end of array) it will write past allocated range. (Or just
> > > > randomly all over memory) And in process it will destroy whatever it can including kernel structure.
> > > > For crazy reading about crazy programs I recommend http://ptgmedia.pearsoncmg.com/images/9780321440303/samplechapter/Chen_bonus_ch02.pdf
> > > > Bonus chapter for book "The Old New Thing" by Raymond Chen.
> > > >
> > > > As for contiguous memory - arrays. (Enables simple and fast way to access
> > > > any element while knowing only starting address and size of structure)
> > >
> > > Thank you for your reply;
> > >
> > > Now I understand how virtual addresses can help arrays through contiguous addresses. I
> > > suppose where the physical addresses is the DTLB's problem; but the ALU math is probably
> > > vastly simplified by having contiguous virtual memory to calculate like (array+2) instead
> > > of (YMM15*14/2^3), and there are probably other benefits that I'm not even seeing.
> > >
> > > However, one question. Say an array has an error and just continues writing more and more and more
> > > elements to the array, filling memory. How would having contiguous virtual addresses prevent this
> > > from happening? I can see how it'd make it much easier to "clear" the memory of these errored entries,
> > > but is there a mechanism in place to detect if this error array is happening to stop it?
> >
> >
> > By itself, virtual address translation won't stop it, but eventually the bad application
> > will attempt to write to an address that is not valid in that process's (virtual) address
> > space, and the CPU will generate an exception, which the OS will catch, and then (usually)
> > terminate the offending process. When the offending process is terminated, all of its resources,
> > including memory allocations, are released by the OS, which tracks such things.
> >
> > Note that on most processors virtual addresses and physical address are very similar, occupying a flat
> > series of consecutive addresses. For example, on most 32 bit processors (supporting VM), addresses, both
> > virtual and physical, range from 0 to 4,294,967,295 (4GiB). Those address spaces are commonly divided
> > into 1,048,576 consecutive pages of 4096 bytes (4KiB). Many processors can turn off address translation,
> > and the any running code would then just access a physical address. Address translations basically sets
> > up a table to translate those million virtual page numbers for a process to physical page numbers. With
> > address translation on, a process wanting (say) 12KB of memory might get that assigned to virtual pages
> > 100, 101 and 102, and thus would access those as 12K consecutive addresses starting at (virtual) address
> > 409,600. Those three virtual pages might be mapped to physical pages 1000, 9876, and 123467, at the OS's
> > whim. Absent a specific request to share them, those three physical pages would *not* be mapped into any
> > other process's virtual address space, and so would only
> > be accessible to the one process. And while actual
> > virtual memory is less relevant these days, the OS might need more physical pages for one process and it
> > might page one of those virtual pages to disk, freeing up the associated physical page. When the process
> > tries to access that paged-out page, the OS will catch the associated exception, read the needed physical
> > page back into memory, fix up the translation, and let the application continue.
> >
> > In some situations, guard pages are generated adjacent to certain areas of (virtual) memory,
> > which allows running off the end of arrays to be caught fairly quickly. It's a limited solution
> > at best, since pages are too large to waste on small objects in most cases.
> >
> > But having a separate virtual address space prevents the errant process for trashing the
> > memory assigned to *other* processes. So while the buggy word processor dies a horrible
> > death, taking your document with it, the spreadsheet you were editing at the same time continues
> > running as if nothing had happened. Of course some processes cooperate and communicate with
> > each other, and will often react badly to one of the partners suddenly dying.
> >
>
> Ah, that all makes sense! So the OS can detect when things go
> rogue and kills off buggy processes! That's quite logical!
>
> Everything you said makes sense, and I definitely am learning
> quite a bit from it. But I have one outstanding question;
>
> You said that the OS usually uses 4KB pages, right? Well, how come Haswell has support for 4MB and
> 1GB pages? What benefit do these extra page sizes grant if the OS can only assign 4KB pages?
>
> Thank you for your very informative reply!
Obviously the OS needs to support larger pages in order to make use of them, but in most implementations, an OS not wanting to use than can just ignore them. These are often called superpages.
x86 CPUs have supported 2MB pages with the 8 byte (PAE style) page tables for quite some time. 4MB pages existed along with a different set of address space extensions, known as PSE, and these are largely irrelevant now - most OSs, even in 32 bit mode, use the PAE format page tables because it provides access to the "NX" page access bit).
On x86 (and many other systems, but certainly not all), page tables are basically arranged in a tree. Non-PAE page tables in x86 use a 4K "page directory" pointing to 1024 (4KB) page tables, each of which contains a four byte page table entry, which maps one virtual page to a physical page (or says the mapping is invalid). That 1024 x 1024 leads to the million page tables entries needed to map the entire 32 bit address space.
A separate copy of that tree exists for each address space (roughly synonymous with "process" in this context).
An important factor is that not all 1024 entries in the page directory need be populated (for example, if an entire aligned 4MB region is unmapped, that can be handled my marking the entry in the page directory invalid, which avoids the need to provide for an entire 4KB page table with nothing but invalid slots in it. This considerably compresses the size of the page table tree (otherwise each address space would need 4MB of translation table). Shared areas are created by having two page tables map the same physical page(s).
With PAE, page table entries are lengthened to eight bytes each (which leaves room for more physical address bits, as well as other control bits, like the NX bit). But since the page tables remain 4KB each, each level of page table only maps 512 entries (rather than 1024 in the non-PAE case), so that a single page directory entry only maps a 2MB region of memory (since it points to a page table that now has only 512 entries). So there are more levels of page table (three, if you use PAE in 32 bit mode, up to four* total in 64 bit mode). The third level of page table maps 1GB per entry, the fourth 512GB per entry.
The first (and smaller) advantage or large pages is that you can map large amounts of storage without having to construct page tables down to the 4KB page level. For example, with 2MB pages, you can map the entire 4GB address space with four page directory pages (totaling 16KB, plus 32 bytes for the PDPTEs). Newer processors have allowed pages to be defined at the third page table level as well, hence the 1GB pages.
Logically, after the CPU generates the virtual address is walks the page table until if finds the page table entry needed to determine the physical page assignment. In practice doing that would be horrible (even with two level page tables, you've now tripled the number of memory accesses needed for any memory reference). That's why we have TLBs. TLBs are special caches that remember frequently used translations. SO if you accessed virtual page 12345, the first time the CPU would have to walk the page time to find out that that was actually on physical page 9876. The second time, however, that information will be available in the TLB, where it can be accessed much, much faster.
Unfortunately there are two conflicting design goals for TLBs - large size and speed. A large TLB (or really, any cache) will necessarily be slow. The big advantage of large pages is that it can allow the TLB to map much more storage with the same number of entries. There have, of course been implementation details the mess things up some, for example, some implementations have used separate large and small page TLBs, with the large page TLBs being very small - indiscriminately making lots of large pages with a very small large page TLB will quickly cause that to thrash. So an OS needs to somewhat carefully manage large pages. The other downside to large pages is that they are, in fact, units - the OS is stuck dealing with the large page as a whole (IOW it cannot assign only a partial large page to an address space), which may lead to wasted memory if the OS is not careful.
*That's not enough to map the entire 64 bit address space, but extending that to five or sex levels would be straight-forward enough, although there will be additional issues to deal with when address spaces get that large.