By: rwessel (robertwessel.delete@this.yahoo.com), July 30, 2013 7:01 pm
Room: Moderated Discussions
Sebastian Soeiro (sebastian_2896.delete@this.hotmail.com) on July 30, 2013 5:27 pm wrote:
> Paul A. Clayton (paaronclayton.delete@this.gmail.com) on July 30, 2013 3:59 pm wrote:
> > Sebastian Soeiro (sebastian_2896.delete@this.hotmail.com) on July 30, 2013 1:18 pm wrote:
> > [snip]
> > > So page tables are stored in memory and there are TLBs to manage those tables currently being used it seems.
> >
> > The term "Translation Lookaside Buffer" is used in a more technical sense as only referring to the
> > cache of translation and permission information with the term "Memory Management Unit" including
> > the TLB and (in some cases) hardware to load TLB entries from the in-memory page table (or a software
> > TLB)--this is called a hardware page table walker--and to update "accessed" and "dirty" bits in
> > the Page Table Entry in memory (again not all architectures handle such in hardware).
> >
> > In modern high performance processors, memory accesses by the hardware page table walker are cached,
> > so such accesses do not necessarily have to read or write the actual main memory directly. This data
> > may be cached in specialized structures (which Intel calls Paging-Structure Caches) and/or in the caches
> > uses by processor memory accesses. (Avoiding the use of L1 caches by the hardware page table walker
> > reduces contention with the processing core for its more timing critical L1 cache resources.)
> >
> > Software (the operating system or hypervisor) fills the page table with data. Typically
> > software will also clear "accessed" bits occasionally to provide a measure of how recently
> > (and frequently) a page of memory is used. (This information can be used to choose a good
> > victim if a page needs to be swapped out.) Software may also clear "dirty" bits if
> >
> > > Just to clarify, this explanation is only for the TLB for the main memory, right?
> >
> > I am not certain what you mean by "the TLB for the main memory". The TLB is a cache of the page table(s).
> > A page table provides translations (and permissions) for a virtual address space; caching this information
> > reduces the cost of look-ups which would ordinarily occur for every memory access.
> >
> > > The L1DTLB or L3 TLB have no knowledge of whats going on in main memory, correct?
> >
> > TLBs are traditionally not coherent (or, more accurately, coherence is managed by software). This means
> > that if a page table entry is changed by software (or by a memory management unit updating its "accessed"
> > or "dirty" bit), the system's TLBs can contain stale data
> > until the old entry is either invalidated by software
> > or is naturally evicted from the TLB by other PTEs being
> > loaded into the TLB. (In the case of "accessed" and
> > "dirty" bit updating, this is not a major problem because
> > the hardware only updates in one direction--so there
> > is no way to inconsistently order the actions of different
> > MMUs--and other than updating these bits the hardware
> > does not act on this information--so at worst a few extraneous updates might occur.)
> >
> > (This non-coherence is becoming more of a concern with the increasing commonness of multiprocessor systems.
> > Forcing every processor in the system to take an interrupt
> > to run a software routine to invalidate a (possibly
> > not present) TLB entry introduces more overhead as the number of processors increases.)
> >
> > > Also, if the TLB is on-die; where is it? Is it integrated into the IMC or another location?
> >
> > In a typical processor, a TLB access is needed for every memory access to provide translation and
> > permission information (caches are typically tagged based on physical memory addresses rather than
> > virtual addresses, so translation information would be needed before a cache hit or miss could be
> > determined). This means that the L1 TLB tends to be tightly coupled to the L1 caches. (In some cases,
> > a very small TLB--usually called a microTLB--for instruction pages is provided with a single L1 TLB
> > for both instruction and data accesses. Instruction accesses have very high locality of reference,
> > so even a two-entry microTLB can greatly reduce access contention for a shared L1 TLB.)
> >
> > L2 TLBs are often less tightly connected to the processing core (Itanium implementations being something of
> > an exception; these access the L2 TLBs for every store and
> > for all floating-point memory operations.). A more
> > typical L2 TLB is only accessed on a miss in the L1 TLB,
> > so its connection to the core is more indirect. (Note
> > that an processor could be designed to use L2 TLBs to provide translations for prefetch engines.)
> >
> > There are many variations on how (primarily non-L1) TLB resources can be shared. Some implementations
> > provide separate instruction and data L2 TLBs while others use a unified L2 TLB. TLB sharing across
> > multiple cores have been proposed. (This is very similar to the sharing considerations for ordinary
> > caches. Sharing reduces the commonness of underutilized resources but increases contention for those
> > resources and reduces optimization opportunities from tighter binding or specialized use.)
> >
> > (Unlike ordinary caches, TLBs also have issues of different page sizes. This introduces another
> > area where separate vs. shared trade-offs must be considered. For multi-level page tables
> > and linear page tables, the caching of page table node entries is another concern that ordinary
> > caches do not need to consider with respect to sharing vs. specializing.)
> >
> > This subject can get very complex, but the basic principles are fairly accessible (if
> > the presenter does not confuse the reader with digressions and extraneous detail!).
>
> Thanks again for your answer! I find it funny how this is basic to you; definitely impressive
> as I'm trying very hard to wrap my head around it, and the more I uncover, it seems like I'm
> taking one step forward and two steps back. There's just so much to learn. Thanks again.
>
> So accessing, reading, and writing to the TLB is done by the table walker, correct?
> The store/load units of the core itself do not interact directly with the TLB (in this
> context, meaning, the table of translations and permissions) itself, correct?
>
> By "TLB for the main memory", I am referring to the TLB for the RAM. I refer to RAM by the
> common term "memory" and to the caches as "caches." Sorry for the confusion. Though what I
> meant by the question was; do multiple page tables on apply to the main memory (RAM) due to
> the sheer size of it; or do all caches (L1,L2,L3,etc.) have multiple page tables? I would
> think that either L1 or L2 cache would be too small to make use of multiple page tables.
>
> By the "The L1DTLB or L3 TLB have no knowledge of whats going on in main memory, correct?"
> question, I meant; the L1DTLB or L1DTLB page walker will never have a situation where
> either of those two units themselves will access the main memory (RAM), correct?
>
> I dont believe I understood the answer to my last question. By "location", I meant
> that if the L1DTLB is geographically close to the L1 Dcache, and the L2 TLB is geographically
> close to the L2 cache, and since the TLB for the main memory (RAM) CANNOT be geographically
> close to the RAM as it has to be on-die; where is it located?
>
> Thank you again for your informative answers!
You're making this too complicated. In the generic form, there's a single TLB for a CPU, and it's logically located on the path between the load/store units and the *entire* memory subsystem (caches, and RAM). The CPU, in the process of executing instructions, generates an address for accessing memory. That address is logically looked up in the page tables, and is then used to check the cache(s) and/or access memory as needed. The TLB is a cache of translations which speeds up the process of producing that translation.
The most common design has only a single TLB for the core, and it's very tightly integrated to the load/store units. Some systems have a second level of TLB, to increase the amount of TLB data that can be cached, but that's just an "L2 TLB" and is not in any way associated with (say) the L2 cache. Some processors have multiple TLBs, because of different access paths for memory items (for example, you might have one for instruction fetches, one for integer memory accesses, and one for FP memory accesses if those nominally go to a different cache level). Those can be combined in various ways too - for example a core might have a TLB on the L1I, and a TLB on the L1D, and then have those two TLBs share access to a single L2-TLB. Or there could be two L2-TLBs.
Performance considerations often dictate that the TLB is accessed in parallel with the L1 cache (which leads to the common restriction on L1 cache size being the associativity times the page size).
Other designs actually use virtual addresses in some levels of the cache hierarchy, which allows some cache lookups to proceed without needing to generate a translation. That complicates things a bit (namely page table invalidations and page sharing get more complicated), but doesn't really change the nature of the TLB.
But you don't generally need TLBs past the innermost cache levels, because the further out caches tend to use physical addresses (as does RAM), and once you've done the translation (which you'd normally have to do before you can finish the L1 cache lookup), you've got, by definition, the physical address, so you just keep using that. So there's no TLB associated with the L3 cache or with main memory, rather the TLBs are associated with the address generation process, and have to exist in the path between the generation of virtual addresses, and the top of the memory hierarchy that each address generator accesses.
How TLBs are managed, and seen by the OS, varies widely. In some systems, there's a defined format of the page tables, and when the required translation is not in the TLB, the hardware walks the page tables looking for the correct entry (which is then placed in the TLB). Other systems delegate much of the translation process to software, and the hardware provides some mechanism to allow the software to place a translation into the TLB. Various combinations and modification of those schemes exist. Exactly how visible the actual implementation of the TLB is, also varies widely. On some systems (x86, for example), the visible model of the TLB is fairly simple, and the OS can mostly* ignore the internal details (IOW, the OS doesn't really care if there are multiple TLBs or multiple levels of TLBs). On x86, an OS that correctly managed the TLB on a 386 has an excellent chance of working on the latest core from Intel or AMD. On other systems (IPF, in some modes, for example), the OS may need to know about each separate TLB, and details about each TLB (number of entries, etc.), and changes in those structures may require modifications to the OS's TLB handling. And even if there are multiple TLBs, there’s often hidden structure underneath the architecturally visible stuff).
In a sense, you can think about the TLBs imposing two requirements on the rest of the system. First, entries need to get put into the TLB in order to allow memory accesses to happen (note that many systems that support address translation allow address translation to be turned off, which usually eliminates the need for the TLBs to be doing anything - but we're ignoring that mode here). As mentioned above, how entries get put in varies quite a bit, from the hardware automatically walking the page tables, to the OS handling an exception for a missing translation and modifying a TLB entry itself. Second, and in many ways much more interesting, is the utterly critical process of getting stale entries *out* of the TLB, especially across multiple cores sharing memory - doing that efficiently can get very complicated.
Many schemes have been proposed, and implemented, to get all of that done
Things like multiple levels of virtual memory (as introduced by virtualization), also complicate things.
Wait, did I say *you* were making things to complicated? ;-)
*There may well be performance implications for ignoring the internals. For example, some systems have different capacities (or even different TLBs) for normal sized pages and large pages. On a system who’s TLBs can only cache a small number of large page entries, over-aggressively using large pages can result in a substantial performance hit as the large page TLB thrashes. Similarly, understanding the TLB purge process may allow an OS to do substantially less work than required by the generic architecture, or to optimize the process considerably. For example, there is often an instruction to purge TLB entries associated with a particular address (in addition to an instruction to purge the entire TLB), although those often (may) purge more entries than are strictly required. Depending on how big the TLB is, and how the specific TLB-entry purge instruction works (performance, actual specificity), there’s some crossover point where it’s faster to just purge the entire TLB (which is usually pretty quick, but you pay when all the reloads happen), vs. purging a bunch of individual entries.
> Paul A. Clayton (paaronclayton.delete@this.gmail.com) on July 30, 2013 3:59 pm wrote:
> > Sebastian Soeiro (sebastian_2896.delete@this.hotmail.com) on July 30, 2013 1:18 pm wrote:
> > [snip]
> > > So page tables are stored in memory and there are TLBs to manage those tables currently being used it seems.
> >
> > The term "Translation Lookaside Buffer" is used in a more technical sense as only referring to the
> > cache of translation and permission information with the term "Memory Management Unit" including
> > the TLB and (in some cases) hardware to load TLB entries from the in-memory page table (or a software
> > TLB)--this is called a hardware page table walker--and to update "accessed" and "dirty" bits in
> > the Page Table Entry in memory (again not all architectures handle such in hardware).
> >
> > In modern high performance processors, memory accesses by the hardware page table walker are cached,
> > so such accesses do not necessarily have to read or write the actual main memory directly. This data
> > may be cached in specialized structures (which Intel calls Paging-Structure Caches) and/or in the caches
> > uses by processor memory accesses. (Avoiding the use of L1 caches by the hardware page table walker
> > reduces contention with the processing core for its more timing critical L1 cache resources.)
> >
> > Software (the operating system or hypervisor) fills the page table with data. Typically
> > software will also clear "accessed" bits occasionally to provide a measure of how recently
> > (and frequently) a page of memory is used. (This information can be used to choose a good
> > victim if a page needs to be swapped out.) Software may also clear "dirty" bits if
> >
> > > Just to clarify, this explanation is only for the TLB for the main memory, right?
> >
> > I am not certain what you mean by "the TLB for the main memory". The TLB is a cache of the page table(s).
> > A page table provides translations (and permissions) for a virtual address space; caching this information
> > reduces the cost of look-ups which would ordinarily occur for every memory access.
> >
> > > The L1DTLB or L3 TLB have no knowledge of whats going on in main memory, correct?
> >
> > TLBs are traditionally not coherent (or, more accurately, coherence is managed by software). This means
> > that if a page table entry is changed by software (or by a memory management unit updating its "accessed"
> > or "dirty" bit), the system's TLBs can contain stale data
> > until the old entry is either invalidated by software
> > or is naturally evicted from the TLB by other PTEs being
> > loaded into the TLB. (In the case of "accessed" and
> > "dirty" bit updating, this is not a major problem because
> > the hardware only updates in one direction--so there
> > is no way to inconsistently order the actions of different
> > MMUs--and other than updating these bits the hardware
> > does not act on this information--so at worst a few extraneous updates might occur.)
> >
> > (This non-coherence is becoming more of a concern with the increasing commonness of multiprocessor systems.
> > Forcing every processor in the system to take an interrupt
> > to run a software routine to invalidate a (possibly
> > not present) TLB entry introduces more overhead as the number of processors increases.)
> >
> > > Also, if the TLB is on-die; where is it? Is it integrated into the IMC or another location?
> >
> > In a typical processor, a TLB access is needed for every memory access to provide translation and
> > permission information (caches are typically tagged based on physical memory addresses rather than
> > virtual addresses, so translation information would be needed before a cache hit or miss could be
> > determined). This means that the L1 TLB tends to be tightly coupled to the L1 caches. (In some cases,
> > a very small TLB--usually called a microTLB--for instruction pages is provided with a single L1 TLB
> > for both instruction and data accesses. Instruction accesses have very high locality of reference,
> > so even a two-entry microTLB can greatly reduce access contention for a shared L1 TLB.)
> >
> > L2 TLBs are often less tightly connected to the processing core (Itanium implementations being something of
> > an exception; these access the L2 TLBs for every store and
> > for all floating-point memory operations.). A more
> > typical L2 TLB is only accessed on a miss in the L1 TLB,
> > so its connection to the core is more indirect. (Note
> > that an processor could be designed to use L2 TLBs to provide translations for prefetch engines.)
> >
> > There are many variations on how (primarily non-L1) TLB resources can be shared. Some implementations
> > provide separate instruction and data L2 TLBs while others use a unified L2 TLB. TLB sharing across
> > multiple cores have been proposed. (This is very similar to the sharing considerations for ordinary
> > caches. Sharing reduces the commonness of underutilized resources but increases contention for those
> > resources and reduces optimization opportunities from tighter binding or specialized use.)
> >
> > (Unlike ordinary caches, TLBs also have issues of different page sizes. This introduces another
> > area where separate vs. shared trade-offs must be considered. For multi-level page tables
> > and linear page tables, the caching of page table node entries is another concern that ordinary
> > caches do not need to consider with respect to sharing vs. specializing.)
> >
> > This subject can get very complex, but the basic principles are fairly accessible (if
> > the presenter does not confuse the reader with digressions and extraneous detail!).
>
> Thanks again for your answer! I find it funny how this is basic to you; definitely impressive
> as I'm trying very hard to wrap my head around it, and the more I uncover, it seems like I'm
> taking one step forward and two steps back. There's just so much to learn. Thanks again.
>
> So accessing, reading, and writing to the TLB is done by the table walker, correct?
> The store/load units of the core itself do not interact directly with the TLB (in this
> context, meaning, the table of translations and permissions) itself, correct?
>
> By "TLB for the main memory", I am referring to the TLB for the RAM. I refer to RAM by the
> common term "memory" and to the caches as "caches." Sorry for the confusion. Though what I
> meant by the question was; do multiple page tables on apply to the main memory (RAM) due to
> the sheer size of it; or do all caches (L1,L2,L3,etc.) have multiple page tables? I would
> think that either L1 or L2 cache would be too small to make use of multiple page tables.
>
> By the "The L1DTLB or L3 TLB have no knowledge of whats going on in main memory, correct?"
> question, I meant; the L1DTLB or L1DTLB page walker will never have a situation where
> either of those two units themselves will access the main memory (RAM), correct?
>
> I dont believe I understood the answer to my last question. By "location", I meant
> that if the L1DTLB is geographically close to the L1 Dcache, and the L2 TLB is geographically
> close to the L2 cache, and since the TLB for the main memory (RAM) CANNOT be geographically
> close to the RAM as it has to be on-die; where is it located?
>
> Thank you again for your informative answers!
You're making this too complicated. In the generic form, there's a single TLB for a CPU, and it's logically located on the path between the load/store units and the *entire* memory subsystem (caches, and RAM). The CPU, in the process of executing instructions, generates an address for accessing memory. That address is logically looked up in the page tables, and is then used to check the cache(s) and/or access memory as needed. The TLB is a cache of translations which speeds up the process of producing that translation.
The most common design has only a single TLB for the core, and it's very tightly integrated to the load/store units. Some systems have a second level of TLB, to increase the amount of TLB data that can be cached, but that's just an "L2 TLB" and is not in any way associated with (say) the L2 cache. Some processors have multiple TLBs, because of different access paths for memory items (for example, you might have one for instruction fetches, one for integer memory accesses, and one for FP memory accesses if those nominally go to a different cache level). Those can be combined in various ways too - for example a core might have a TLB on the L1I, and a TLB on the L1D, and then have those two TLBs share access to a single L2-TLB. Or there could be two L2-TLBs.
Performance considerations often dictate that the TLB is accessed in parallel with the L1 cache (which leads to the common restriction on L1 cache size being the associativity times the page size).
Other designs actually use virtual addresses in some levels of the cache hierarchy, which allows some cache lookups to proceed without needing to generate a translation. That complicates things a bit (namely page table invalidations and page sharing get more complicated), but doesn't really change the nature of the TLB.
But you don't generally need TLBs past the innermost cache levels, because the further out caches tend to use physical addresses (as does RAM), and once you've done the translation (which you'd normally have to do before you can finish the L1 cache lookup), you've got, by definition, the physical address, so you just keep using that. So there's no TLB associated with the L3 cache or with main memory, rather the TLBs are associated with the address generation process, and have to exist in the path between the generation of virtual addresses, and the top of the memory hierarchy that each address generator accesses.
How TLBs are managed, and seen by the OS, varies widely. In some systems, there's a defined format of the page tables, and when the required translation is not in the TLB, the hardware walks the page tables looking for the correct entry (which is then placed in the TLB). Other systems delegate much of the translation process to software, and the hardware provides some mechanism to allow the software to place a translation into the TLB. Various combinations and modification of those schemes exist. Exactly how visible the actual implementation of the TLB is, also varies widely. On some systems (x86, for example), the visible model of the TLB is fairly simple, and the OS can mostly* ignore the internal details (IOW, the OS doesn't really care if there are multiple TLBs or multiple levels of TLBs). On x86, an OS that correctly managed the TLB on a 386 has an excellent chance of working on the latest core from Intel or AMD. On other systems (IPF, in some modes, for example), the OS may need to know about each separate TLB, and details about each TLB (number of entries, etc.), and changes in those structures may require modifications to the OS's TLB handling. And even if there are multiple TLBs, there’s often hidden structure underneath the architecturally visible stuff).
In a sense, you can think about the TLBs imposing two requirements on the rest of the system. First, entries need to get put into the TLB in order to allow memory accesses to happen (note that many systems that support address translation allow address translation to be turned off, which usually eliminates the need for the TLBs to be doing anything - but we're ignoring that mode here). As mentioned above, how entries get put in varies quite a bit, from the hardware automatically walking the page tables, to the OS handling an exception for a missing translation and modifying a TLB entry itself. Second, and in many ways much more interesting, is the utterly critical process of getting stale entries *out* of the TLB, especially across multiple cores sharing memory - doing that efficiently can get very complicated.
Many schemes have been proposed, and implemented, to get all of that done
Things like multiple levels of virtual memory (as introduced by virtualization), also complicate things.
Wait, did I say *you* were making things to complicated? ;-)
*There may well be performance implications for ignoring the internals. For example, some systems have different capacities (or even different TLBs) for normal sized pages and large pages. On a system who’s TLBs can only cache a small number of large page entries, over-aggressively using large pages can result in a substantial performance hit as the large page TLB thrashes. Similarly, understanding the TLB purge process may allow an OS to do substantially less work than required by the generic architecture, or to optimize the process considerably. For example, there is often an instruction to purge TLB entries associated with a particular address (in addition to an instruction to purge the entire TLB), although those often (may) purge more entries than are strictly required. Depending on how big the TLB is, and how the specific TLB-entry purge instruction works (performance, actual specificity), there’s some crossover point where it’s faster to just purge the entire TLB (which is usually pretty quick, but you pay when all the reloads happen), vs. purging a bunch of individual entries.