By: Paul A. Clayton (paaronclayton.delete@this.gmail.com), May 7, 2013 6:51 am
Room: Moderated Discussions
anon (anon.delete@this.anon.com) on May 6, 2013 9:35 pm wrote:
> Paul A. Clayton (paaronclayton.delete@this.gmail.com) on May 6, 2013 9:27 pm wrote:
[snip]
>> There is, of course, the highly unlikely possibility that the huge page TLB entries are also used to cache
>> PDEs (as I have often suggested). If that was the case, the number of entries might be even larger and
>> still be reasonably useful. Since neither Intel nor AMD have ever done this (but instead used separate
>> caching structures for PDEs et al.), I am extremely skeptical that such flexibility is supported.
>
> What kind of caching structures do they use? Do you have any data about this?
I do not know if any size information has been made public (and discovering such information with test programs would be difficult--e.g., needing to prevent ordinary cache hits for page table cache blocks--and the interest in such information is probably very limited, so I doubt it is publicly available). The basic structural information for Intel's mechanism is included in Intel® 64 and IA-32 Architectures
Software Developer’s Manual, Volume 3A: System Programming Guide, Part 1, sections 4.10.3 and 4.10.4.
Intel's mechanism seems to have separate structures for caching PDE (PML2), PDPTE (PML3), and PML4. Like a TLB, each entry contains the permission information (which is simply the value from the in-memory entry for the highest level [PML4] but for lower level entries some fields must be ANDed [for Read/Write, User/Supervisor flags] or ORed [for eXecute Deny flags] with the values of the higher level entries) as well as a pointer to the page associated with that part of the page table. Like TLBs, these structures are not snooped for changes in the backing store (i.e., main memory) but must be explicitly invalidated.
When there is a TLB miss, the hardware page table walker can look at the PDE (PML2) cache for the 2MiB section of the virtual memory address. On a hit, only a single memory access (which might be a hit in the ordinary cache) is needed to fill the TLB. On a miss, the PDPTE (PML3) cache would be checked. On a hit, one memory access would find the appropriate PDE (PML2), presumably loading its information into the PDE cache, and use the information from the PDE to satisfy the TLB miss. (Obviously, such structures could be probed in parallel to reduce latency. [Exploiting a desire for such parallel access, one might use dictionary-based compression. E.g., a PDE cache entry might replace the MSbits of its tag with a pointer into the PDPTE cache. With 128 PDPTE cache entries, this could save more than 20 tag bits [PCID bits + 48 supported virtual address bits - 21 address bits exclusive to the PDE - 7 bits of the PDPTE cache entry number] per PDE cache entry; but such would require inclusion and increase management complexity.])
Thomas W. Barr et al.'s "Translation Caching: Skip, Don’t Walk (the Page Table)" provides an overview of AMD's mechanism (which may have changed), as well as Intel's. AMD's mechanism just provided a separate data cache for paging structure information. I.e., the information is accessed by physical memory addresses and the page table must be fully walked as normal. (Effectively this "only" avoids conflicts with use in the ordinary cache and allows the hardware page table walker to have fast [and non-conflicting] access to such storage.)
> Paul A. Clayton (paaronclayton.delete@this.gmail.com) on May 6, 2013 9:27 pm wrote:
[snip]
>> There is, of course, the highly unlikely possibility that the huge page TLB entries are also used to cache
>> PDEs (as I have often suggested). If that was the case, the number of entries might be even larger and
>> still be reasonably useful. Since neither Intel nor AMD have ever done this (but instead used separate
>> caching structures for PDEs et al.), I am extremely skeptical that such flexibility is supported.
>
> What kind of caching structures do they use? Do you have any data about this?
I do not know if any size information has been made public (and discovering such information with test programs would be difficult--e.g., needing to prevent ordinary cache hits for page table cache blocks--and the interest in such information is probably very limited, so I doubt it is publicly available). The basic structural information for Intel's mechanism is included in Intel® 64 and IA-32 Architectures
Software Developer’s Manual, Volume 3A: System Programming Guide, Part 1, sections 4.10.3 and 4.10.4.
Intel's mechanism seems to have separate structures for caching PDE (PML2), PDPTE (PML3), and PML4. Like a TLB, each entry contains the permission information (which is simply the value from the in-memory entry for the highest level [PML4] but for lower level entries some fields must be ANDed [for Read/Write, User/Supervisor flags] or ORed [for eXecute Deny flags] with the values of the higher level entries) as well as a pointer to the page associated with that part of the page table. Like TLBs, these structures are not snooped for changes in the backing store (i.e., main memory) but must be explicitly invalidated.
When there is a TLB miss, the hardware page table walker can look at the PDE (PML2) cache for the 2MiB section of the virtual memory address. On a hit, only a single memory access (which might be a hit in the ordinary cache) is needed to fill the TLB. On a miss, the PDPTE (PML3) cache would be checked. On a hit, one memory access would find the appropriate PDE (PML2), presumably loading its information into the PDE cache, and use the information from the PDE to satisfy the TLB miss. (Obviously, such structures could be probed in parallel to reduce latency. [Exploiting a desire for such parallel access, one might use dictionary-based compression. E.g., a PDE cache entry might replace the MSbits of its tag with a pointer into the PDPTE cache. With 128 PDPTE cache entries, this could save more than 20 tag bits [PCID bits + 48 supported virtual address bits - 21 address bits exclusive to the PDE - 7 bits of the PDPTE cache entry number] per PDE cache entry; but such would require inclusion and increase management complexity.])
Thomas W. Barr et al.'s "Translation Caching: Skip, Don’t Walk (the Page Table)" provides an overview of AMD's mechanism (which may have changed), as well as Intel's. AMD's mechanism just provided a separate data cache for paging structure information. I.e., the information is accessed by physical memory addresses and the page table must be fully walked as normal. (Effectively this "only" avoids conflicts with use in the ordinary cache and allows the hardware page table walker to have fast [and non-conflicting] access to such storage.)