By: Paul A. Clayton (paaronclayton.delete@this.gmail.com), July 30, 2013 2:59 pm
Room: Moderated Discussions
Sebastian Soeiro (sebastian_2896.delete@this.hotmail.com) on July 30, 2013 1:18 pm wrote:
[snip]
> So page tables are stored in memory and there are TLBs to manage those tables currently being used it seems.
The term "Translation Lookaside Buffer" is used in a more technical sense as only referring to the cache of translation and permission information with the term "Memory Management Unit" including the TLB and (in some cases) hardware to load TLB entries from the in-memory page table (or a software TLB)--this is called a hardware page table walker--and to update "accessed" and "dirty" bits in the Page Table Entry in memory (again not all architectures handle such in hardware).
In modern high performance processors, memory accesses by the hardware page table walker are cached, so such accesses do not necessarily have to read or write the actual main memory directly. This data may be cached in specialized structures (which Intel calls Paging-Structure Caches) and/or in the caches uses by processor memory accesses. (Avoiding the use of L1 caches by the hardware page table walker reduces contention with the processing core for its more timing critical L1 cache resources.)
Software (the operating system or hypervisor) fills the page table with data. Typically software will also clear "accessed" bits occasionally to provide a measure of how recently (and frequently) a page of memory is used. (This information can be used to choose a good victim if a page needs to be swapped out.) Software may also clear "dirty" bits if
> Just to clarify, this explanation is only for the TLB for the main memory, right?
I am not certain what you mean by "the TLB for the main memory". The TLB is a cache of the page table(s). A page table provides translations (and permissions) for a virtual address space; caching this information reduces the cost of look-ups which would ordinarily occur for every memory access.
> The L1DTLB or L3 TLB have no knowledge of whats going on in main memory, correct?
TLBs are traditionally not coherent (or, more accurately, coherence is managed by software). This means that if a page table entry is changed by software (or by a memory management unit updating its "accessed" or "dirty" bit), the system's TLBs can contain stale data until the old entry is either invalidated by software or is naturally evicted from the TLB by other PTEs being loaded into the TLB. (In the case of "accessed" and "dirty" bit updating, this is not a major problem because the hardware only updates in one direction--so there is no way to inconsistently order the actions of different MMUs--and other than updating these bits the hardware does not act on this information--so at worst a few extraneous updates might occur.)
(This non-coherence is becoming more of a concern with the increasing commonness of multiprocessor systems. Forcing every processor in the system to take an interrupt to run a software routine to invalidate a (possibly not present) TLB entry introduces more overhead as the number of processors increases.)
> Also, if the TLB is on-die; where is it? Is it integrated into the IMC or another location?
In a typical processor, a TLB access is needed for every memory access to provide translation and permission information (caches are typically tagged based on physical memory addresses rather than virtual addresses, so translation information would be needed before a cache hit or miss could be determined). This means that the L1 TLB tends to be tightly coupled to the L1 caches. (In some cases, a very small TLB--usually called a microTLB--for instruction pages is provided with a single L1 TLB for both instruction and data accesses. Instruction accesses have very high locality of reference, so even a two-entry microTLB can greatly reduce access contention for a shared L1 TLB.)
L2 TLBs are often less tightly connected to the processing core (Itanium implementations being something of an exception; these access the L2 TLBs for every store and for all floating-point memory operations.). A more typical L2 TLB is only accessed on a miss in the L1 TLB, so its connection to the core is more indirect. (Note that an processor could be designed to use L2 TLBs to provide translations for prefetch engines.)
There are many variations on how (primarily non-L1) TLB resources can be shared. Some implementations provide separate instruction and data L2 TLBs while others use a unified L2 TLB. TLB sharing across multiple cores have been proposed. (This is very similar to the sharing considerations for ordinary caches. Sharing reduces the commonness of underutilized resources but increases contention for those resources and reduces optimization opportunities from tighter binding or specialized use.)
(Unlike ordinary caches, TLBs also have issues of different page sizes. This introduces another area where separate vs. shared trade-offs must be considered. For multi-level page tables and linear page tables, the caching of page table node entries is another concern that ordinary caches do not need to consider with respect to sharing vs. specializing.)
This subject can get very complex, but the basic principles are fairly accessible (if the presenter does not confuse the reader with digressions and extraneous detail!).
[snip]
> So page tables are stored in memory and there are TLBs to manage those tables currently being used it seems.
The term "Translation Lookaside Buffer" is used in a more technical sense as only referring to the cache of translation and permission information with the term "Memory Management Unit" including the TLB and (in some cases) hardware to load TLB entries from the in-memory page table (or a software TLB)--this is called a hardware page table walker--and to update "accessed" and "dirty" bits in the Page Table Entry in memory (again not all architectures handle such in hardware).
In modern high performance processors, memory accesses by the hardware page table walker are cached, so such accesses do not necessarily have to read or write the actual main memory directly. This data may be cached in specialized structures (which Intel calls Paging-Structure Caches) and/or in the caches uses by processor memory accesses. (Avoiding the use of L1 caches by the hardware page table walker reduces contention with the processing core for its more timing critical L1 cache resources.)
Software (the operating system or hypervisor) fills the page table with data. Typically software will also clear "accessed" bits occasionally to provide a measure of how recently (and frequently) a page of memory is used. (This information can be used to choose a good victim if a page needs to be swapped out.) Software may also clear "dirty" bits if
> Just to clarify, this explanation is only for the TLB for the main memory, right?
I am not certain what you mean by "the TLB for the main memory". The TLB is a cache of the page table(s). A page table provides translations (and permissions) for a virtual address space; caching this information reduces the cost of look-ups which would ordinarily occur for every memory access.
> The L1DTLB or L3 TLB have no knowledge of whats going on in main memory, correct?
TLBs are traditionally not coherent (or, more accurately, coherence is managed by software). This means that if a page table entry is changed by software (or by a memory management unit updating its "accessed" or "dirty" bit), the system's TLBs can contain stale data until the old entry is either invalidated by software or is naturally evicted from the TLB by other PTEs being loaded into the TLB. (In the case of "accessed" and "dirty" bit updating, this is not a major problem because the hardware only updates in one direction--so there is no way to inconsistently order the actions of different MMUs--and other than updating these bits the hardware does not act on this information--so at worst a few extraneous updates might occur.)
(This non-coherence is becoming more of a concern with the increasing commonness of multiprocessor systems. Forcing every processor in the system to take an interrupt to run a software routine to invalidate a (possibly not present) TLB entry introduces more overhead as the number of processors increases.)
> Also, if the TLB is on-die; where is it? Is it integrated into the IMC or another location?
In a typical processor, a TLB access is needed for every memory access to provide translation and permission information (caches are typically tagged based on physical memory addresses rather than virtual addresses, so translation information would be needed before a cache hit or miss could be determined). This means that the L1 TLB tends to be tightly coupled to the L1 caches. (In some cases, a very small TLB--usually called a microTLB--for instruction pages is provided with a single L1 TLB for both instruction and data accesses. Instruction accesses have very high locality of reference, so even a two-entry microTLB can greatly reduce access contention for a shared L1 TLB.)
L2 TLBs are often less tightly connected to the processing core (Itanium implementations being something of an exception; these access the L2 TLBs for every store and for all floating-point memory operations.). A more typical L2 TLB is only accessed on a miss in the L1 TLB, so its connection to the core is more indirect. (Note that an processor could be designed to use L2 TLBs to provide translations for prefetch engines.)
There are many variations on how (primarily non-L1) TLB resources can be shared. Some implementations provide separate instruction and data L2 TLBs while others use a unified L2 TLB. TLB sharing across multiple cores have been proposed. (This is very similar to the sharing considerations for ordinary caches. Sharing reduces the commonness of underutilized resources but increases contention for those resources and reduces optimization opportunities from tighter binding or specialized use.)
(Unlike ordinary caches, TLBs also have issues of different page sizes. This introduces another area where separate vs. shared trade-offs must be considered. For multi-level page tables and linear page tables, the caching of page table node entries is another concern that ordinary caches do not need to consider with respect to sharing vs. specializing.)
This subject can get very complex, but the basic principles are fairly accessible (if the presenter does not confuse the reader with digressions and extraneous detail!).