By: , July 31, 2013 1:15 pm
Room: Moderated Discussions
rwessel (robertwessel.delete@this.yahoo.com) on July 30, 2013 7:01 pm wrote:
> Sebastian Soeiro (sebastian_2896.delete@this.hotmail.com) on July 30, 2013 5:27 pm wrote:
> > Paul A. Clayton (paaronclayton.delete@this.gmail.com) on July 30, 2013 3:59 pm wrote:
> > > Sebastian Soeiro (sebastian_2896.delete@this.hotmail.com) on July 30, 2013 1:18 pm wrote:
> > > [snip]
> > > > So page tables are stored in memory and there are TLBs to manage those tables currently being used it seems.
> > >
> > > The term "Translation Lookaside Buffer" is used in a more technical sense as only referring to the
> > > cache of translation and permission information with the term "Memory Management Unit" including
> > > the TLB and (in some cases) hardware to load TLB entries from the in-memory page table (or a software
> > > TLB)--this is called a hardware page table walker--and to update "accessed" and "dirty" bits in
> > > the Page Table Entry in memory (again not all architectures handle such in hardware).
> > >
> > > In modern high performance processors, memory accesses by the hardware page table walker are cached,
> > > so such accesses do not necessarily have to read or write the actual main memory directly. This data
> > > may be cached in specialized structures (which Intel calls Paging-Structure Caches) and/or in the caches
> > > uses by processor memory accesses. (Avoiding the use of L1 caches by the hardware page table walker
> > > reduces contention with the processing core for its more timing critical L1 cache resources.)
> > >
> > > Software (the operating system or hypervisor) fills the page table with data. Typically
> > > software will also clear "accessed" bits occasionally to provide a measure of how recently
> > > (and frequently) a page of memory is used. (This information can be used to choose a good
> > > victim if a page needs to be swapped out.) Software may also clear "dirty" bits if
> > >
> > > > Just to clarify, this explanation is only for the TLB for the main memory, right?
> > >
> > > I am not certain what you mean by "the TLB for the main memory". The TLB is a cache of the page table(s).
> > > A page table provides translations (and permissions) for a virtual address space; caching this information
> > > reduces the cost of look-ups which would ordinarily occur for every memory access.
> > >
> > > > The L1DTLB or L3 TLB have no knowledge of whats going on in main memory, correct?
> > >
> > > TLBs are traditionally not coherent (or, more accurately, coherence is managed by software). This means
> > > that if a page table entry is changed by software (or by a memory management unit updating its "accessed"
> > > or "dirty" bit), the system's TLBs can contain stale data
> > > until the old entry is either invalidated by software
> > > or is naturally evicted from the TLB by other PTEs being
> > > loaded into the TLB. (In the case of "accessed" and
> > > "dirty" bit updating, this is not a major problem because
> > > the hardware only updates in one direction--so there
> > > is no way to inconsistently order the actions of different
> > > MMUs--and other than updating these bits the hardware
> > > does not act on this information--so at worst a few extraneous updates might occur.)
> > >
> > > (This non-coherence is becoming more of a concern with the increasing commonness of multiprocessor systems.
> > > Forcing every processor in the system to take an interrupt
> > > to run a software routine to invalidate a (possibly
> > > not present) TLB entry introduces more overhead as the number of processors increases.)
> > >
> > > > Also, if the TLB is on-die; where is it? Is it integrated into the IMC or another location?
> > >
> > > In a typical processor, a TLB access is needed for every memory access to provide translation and
> > > permission information (caches are typically tagged based on physical memory addresses rather than
> > > virtual addresses, so translation information would be needed before a cache hit or miss could be
> > > determined). This means that the L1 TLB tends to be tightly coupled to the L1 caches. (In some cases,
> > > a very small TLB--usually called a microTLB--for instruction pages is provided with a single L1 TLB
> > > for both instruction and data accesses. Instruction accesses have very high locality of reference,
> > > so even a two-entry microTLB can greatly reduce access contention for a shared L1 TLB.)
> > >
> > > L2 TLBs are often less tightly connected to the processing core (Itanium implementations being something of
> > > an exception; these access the L2 TLBs for every store and
> > > for all floating-point memory operations.). A more
> > > typical L2 TLB is only accessed on a miss in the L1 TLB,
> > > so its connection to the core is more indirect. (Note
> > > that an processor could be designed to use L2 TLBs to provide translations for prefetch engines.)
> > >
> > > There are many variations on how (primarily non-L1) TLB resources can be shared. Some implementations
> > > provide separate instruction and data L2 TLBs while others use a unified L2 TLB. TLB sharing across
> > > multiple cores have been proposed. (This is very similar to the sharing considerations for ordinary
> > > caches. Sharing reduces the commonness of underutilized resources but increases contention for those
> > > resources and reduces optimization opportunities from tighter binding or specialized use.)
> > >
> > > (Unlike ordinary caches, TLBs also have issues of different page sizes. This introduces another
> > > area where separate vs. shared trade-offs must be considered. For multi-level page tables
> > > and linear page tables, the caching of page table node entries is another concern that ordinary
> > > caches do not need to consider with respect to sharing vs. specializing.)
> > >
> > > This subject can get very complex, but the basic principles are fairly accessible (if
> > > the presenter does not confuse the reader with digressions and extraneous detail!).
> >
> > Thanks again for your answer! I find it funny how this is basic to you; definitely impressive
> > as I'm trying very hard to wrap my head around it, and the more I uncover, it seems like I'm
> > taking one step forward and two steps back. There's just so much to learn. Thanks again.
> >
> > So accessing, reading, and writing to the TLB is done by the table walker, correct?
> > The store/load units of the core itself do not interact directly with the TLB (in this
> > context, meaning, the table of translations and permissions) itself, correct?
> >
> > By "TLB for the main memory", I am referring to the TLB for the RAM. I refer to RAM by the
> > common term "memory" and to the caches as "caches." Sorry for the confusion. Though what I
> > meant by the question was; do multiple page tables on apply to the main memory (RAM) due to
> > the sheer size of it; or do all caches (L1,L2,L3,etc.) have multiple page tables? I would
> > think that either L1 or L2 cache would be too small to make use of multiple page tables.
> >
> > By the "The L1DTLB or L3 TLB have no knowledge of whats going on in main memory, correct?"
> > question, I meant; the L1DTLB or L1DTLB page walker will never have a situation where
> > either of those two units themselves will access the main memory (RAM), correct?
> >
> > I dont believe I understood the answer to my last question. By "location", I meant
> > that if the L1DTLB is geographically close to the L1 Dcache, and the L2 TLB is geographically
> > close to the L2 cache, and since the TLB for the main memory (RAM) CANNOT be geographically
> > close to the RAM as it has to be on-die; where is it located?
> >
> > Thank you again for your informative answers!
>
>
> You're making this too complicated. In the generic form, there's a single TLB for a CPU, and it's logically
> located on the path between the load/store units and the *entire* memory subsystem (caches, and RAM). The
> CPU, in the process of executing instructions, generates an address for accessing memory. That address is
> logically looked up in the page tables, and is then used to check the cache(s) and/or access memory as needed.
> The TLB is a cache of translations which speeds up the process of producing that translation.
>
> The most common design has only a single TLB for the core, and it's very tightly integrated to the
> load/store units. Some systems have a second level of TLB, to increase the amount of TLB data that
> can be cached, but that's just an "L2 TLB" and is not in any way associated with (say) the L2 cache.
> Some processors have multiple TLBs, because of different access paths for memory items (for example,
> you might have one for instruction fetches, one for integer memory accesses, and one for FP memory
> accesses if those nominally go to a different cache level). Those can be combined in various ways
> too - for example a core might have a TLB on the L1I, and a TLB on the L1D, and then have those
> two TLBs share access to a single L2-TLB. Or there could be two L2-TLBs.
>
> Performance considerations often dictate that the TLB is accessed in parallel with the L1 cache (which
> leads to the common restriction on L1 cache size being the associativity times the page size).
>
> Other designs actually use virtual addresses in some levels of the cache hierarchy, which allows some cache
> lookups to proceed without needing to generate a translation. That complicates things a bit (namely page table
> invalidations and page sharing get more complicated), but doesn't really change the nature of the TLB.
>
> But you don't generally need TLBs past the innermost cache levels, because the further out caches tend
> to use physical addresses (as does RAM), and once you've done the translation (which you'd normally have
> to do before you can finish the L1 cache lookup), you've got, by definition, the physical address, so you
> just keep using that. So there's no TLB associated with the L3 cache or with main memory, rather the TLBs
> are associated with the address generation process, and have to exist in the path between the generation
> of virtual addresses, and the top of the memory hierarchy that each address generator accesses.
>
> How TLBs are managed, and seen by the OS, varies widely. In some systems, there's a defined format of the page
> tables, and when the required translation is not in the TLB, the hardware walks the page tables looking for
> the correct entry (which is then placed in the TLB). Other systems delegate much of the translation process
> to software, and the hardware provides some mechanism to allow the software to place a translation into the
> TLB. Various combinations and modification of those schemes exist. Exactly how visible the actual implementation
> of the TLB is, also varies widely. On some systems (x86, for example), the visible model of the TLB is fairly
> simple, and the OS can mostly* ignore the internal details (IOW, the OS doesn't really care if there are multiple
> TLBs or multiple levels of TLBs). On x86, an OS that correctly managed the TLB on a 386 has an excellent chance
> of working on the latest core from Intel or AMD. On other systems (IPF, in some modes, for example), the OS
> may need to know about each separate TLB, and details about each TLB (number of entries, etc.), and changes
> in those structures may require modifications to the OS's TLB handling. And even if there are multiple TLBs,
> there’s often hidden structure underneath the architecturally visible stuff).
>
> In a sense, you can think about the TLBs imposing two requirements on the rest of the system. First, entries
> need to get put into the TLB in order to allow memory accesses to happen (note that many systems that support
> address translation allow address translation to be turned off, which usually eliminates the need for the
> TLBs to be doing anything - but we're ignoring that mode here). As mentioned above, how entries get put
> in varies quite a bit, from the hardware automatically walking the page tables, to the OS handling an exception
> for a missing translation and modifying a TLB entry itself. Second, and in many ways much more interesting,
> is the utterly critical process of getting stale entries *out* of the TLB, especially across multiple cores
> sharing memory - doing that efficiently can get very complicated.
>
> Many schemes have been proposed, and implemented, to get all of that done
>
> Things like multiple levels of virtual memory (as introduced by virtualization), also complicate things.
>
> Wait, did I say *you* were making things to complicated? ;-)
>
>
>
> *There may well be performance implications for ignoring the internals. For example, some systems have
> different capacities (or even different TLBs) for normal sized pages and large pages. On a system who’s
> TLBs can only cache a small number of large page entries, over-aggressively using large pages can result
> in a substantial performance hit as the large page TLB thrashes. Similarly, understanding the TLB purge
> process may allow an OS to do substantially less work than required by the generic architecture, or to
> optimize the process considerably. For example, there is often an instruction to purge TLB entries associated
> with a particular address (in addition to an instruction to purge the entire TLB), although those often
> (may) purge more entries than are strictly required. Depending on how big the TLB is, and how the specific
> TLB-entry purge instruction works (performance, actual specificity), there’s some crossover point where
> it’s faster to just purge the entire TLB (which is usually pretty quick, but you pay when all the reloads
> happen), vs. purging a bunch of individual entries.
>
Wow thanks! That is definitely a lot of information, I wonder if I can wrap my head around that...
So if I got this correctly...
- The TLB lies in the store components of a core. (I know you said one TLB per CPU... but then you said that the TLB lies in the store components, so you must've meant core, correct?)
- There is not multiple TLBs (like I thought) unless they are tertiary TLBs to the primary TLB, or a microTLB to help sort out the primary TLB.
- The TLB is just a cache of the last acceses to information; their translations and permissions.
- NO caches have a TLB solely for them; though all caches have a page table or multiple in them
- The TLB uses these page tables to track what is inside a cache/RAM and uses the table to translate virtual addresses to physical addresses.
Did I... Get it right? I really hope so, this is going in a completely different direction than I thought it would.
Nonetheless, thank you again for all your help guys!
> Sebastian Soeiro (sebastian_2896.delete@this.hotmail.com) on July 30, 2013 5:27 pm wrote:
> > Paul A. Clayton (paaronclayton.delete@this.gmail.com) on July 30, 2013 3:59 pm wrote:
> > > Sebastian Soeiro (sebastian_2896.delete@this.hotmail.com) on July 30, 2013 1:18 pm wrote:
> > > [snip]
> > > > So page tables are stored in memory and there are TLBs to manage those tables currently being used it seems.
> > >
> > > The term "Translation Lookaside Buffer" is used in a more technical sense as only referring to the
> > > cache of translation and permission information with the term "Memory Management Unit" including
> > > the TLB and (in some cases) hardware to load TLB entries from the in-memory page table (or a software
> > > TLB)--this is called a hardware page table walker--and to update "accessed" and "dirty" bits in
> > > the Page Table Entry in memory (again not all architectures handle such in hardware).
> > >
> > > In modern high performance processors, memory accesses by the hardware page table walker are cached,
> > > so such accesses do not necessarily have to read or write the actual main memory directly. This data
> > > may be cached in specialized structures (which Intel calls Paging-Structure Caches) and/or in the caches
> > > uses by processor memory accesses. (Avoiding the use of L1 caches by the hardware page table walker
> > > reduces contention with the processing core for its more timing critical L1 cache resources.)
> > >
> > > Software (the operating system or hypervisor) fills the page table with data. Typically
> > > software will also clear "accessed" bits occasionally to provide a measure of how recently
> > > (and frequently) a page of memory is used. (This information can be used to choose a good
> > > victim if a page needs to be swapped out.) Software may also clear "dirty" bits if
> > >
> > > > Just to clarify, this explanation is only for the TLB for the main memory, right?
> > >
> > > I am not certain what you mean by "the TLB for the main memory". The TLB is a cache of the page table(s).
> > > A page table provides translations (and permissions) for a virtual address space; caching this information
> > > reduces the cost of look-ups which would ordinarily occur for every memory access.
> > >
> > > > The L1DTLB or L3 TLB have no knowledge of whats going on in main memory, correct?
> > >
> > > TLBs are traditionally not coherent (or, more accurately, coherence is managed by software). This means
> > > that if a page table entry is changed by software (or by a memory management unit updating its "accessed"
> > > or "dirty" bit), the system's TLBs can contain stale data
> > > until the old entry is either invalidated by software
> > > or is naturally evicted from the TLB by other PTEs being
> > > loaded into the TLB. (In the case of "accessed" and
> > > "dirty" bit updating, this is not a major problem because
> > > the hardware only updates in one direction--so there
> > > is no way to inconsistently order the actions of different
> > > MMUs--and other than updating these bits the hardware
> > > does not act on this information--so at worst a few extraneous updates might occur.)
> > >
> > > (This non-coherence is becoming more of a concern with the increasing commonness of multiprocessor systems.
> > > Forcing every processor in the system to take an interrupt
> > > to run a software routine to invalidate a (possibly
> > > not present) TLB entry introduces more overhead as the number of processors increases.)
> > >
> > > > Also, if the TLB is on-die; where is it? Is it integrated into the IMC or another location?
> > >
> > > In a typical processor, a TLB access is needed for every memory access to provide translation and
> > > permission information (caches are typically tagged based on physical memory addresses rather than
> > > virtual addresses, so translation information would be needed before a cache hit or miss could be
> > > determined). This means that the L1 TLB tends to be tightly coupled to the L1 caches. (In some cases,
> > > a very small TLB--usually called a microTLB--for instruction pages is provided with a single L1 TLB
> > > for both instruction and data accesses. Instruction accesses have very high locality of reference,
> > > so even a two-entry microTLB can greatly reduce access contention for a shared L1 TLB.)
> > >
> > > L2 TLBs are often less tightly connected to the processing core (Itanium implementations being something of
> > > an exception; these access the L2 TLBs for every store and
> > > for all floating-point memory operations.). A more
> > > typical L2 TLB is only accessed on a miss in the L1 TLB,
> > > so its connection to the core is more indirect. (Note
> > > that an processor could be designed to use L2 TLBs to provide translations for prefetch engines.)
> > >
> > > There are many variations on how (primarily non-L1) TLB resources can be shared. Some implementations
> > > provide separate instruction and data L2 TLBs while others use a unified L2 TLB. TLB sharing across
> > > multiple cores have been proposed. (This is very similar to the sharing considerations for ordinary
> > > caches. Sharing reduces the commonness of underutilized resources but increases contention for those
> > > resources and reduces optimization opportunities from tighter binding or specialized use.)
> > >
> > > (Unlike ordinary caches, TLBs also have issues of different page sizes. This introduces another
> > > area where separate vs. shared trade-offs must be considered. For multi-level page tables
> > > and linear page tables, the caching of page table node entries is another concern that ordinary
> > > caches do not need to consider with respect to sharing vs. specializing.)
> > >
> > > This subject can get very complex, but the basic principles are fairly accessible (if
> > > the presenter does not confuse the reader with digressions and extraneous detail!).
> >
> > Thanks again for your answer! I find it funny how this is basic to you; definitely impressive
> > as I'm trying very hard to wrap my head around it, and the more I uncover, it seems like I'm
> > taking one step forward and two steps back. There's just so much to learn. Thanks again.
> >
> > So accessing, reading, and writing to the TLB is done by the table walker, correct?
> > The store/load units of the core itself do not interact directly with the TLB (in this
> > context, meaning, the table of translations and permissions) itself, correct?
> >
> > By "TLB for the main memory", I am referring to the TLB for the RAM. I refer to RAM by the
> > common term "memory" and to the caches as "caches." Sorry for the confusion. Though what I
> > meant by the question was; do multiple page tables on apply to the main memory (RAM) due to
> > the sheer size of it; or do all caches (L1,L2,L3,etc.) have multiple page tables? I would
> > think that either L1 or L2 cache would be too small to make use of multiple page tables.
> >
> > By the "The L1DTLB or L3 TLB have no knowledge of whats going on in main memory, correct?"
> > question, I meant; the L1DTLB or L1DTLB page walker will never have a situation where
> > either of those two units themselves will access the main memory (RAM), correct?
> >
> > I dont believe I understood the answer to my last question. By "location", I meant
> > that if the L1DTLB is geographically close to the L1 Dcache, and the L2 TLB is geographically
> > close to the L2 cache, and since the TLB for the main memory (RAM) CANNOT be geographically
> > close to the RAM as it has to be on-die; where is it located?
> >
> > Thank you again for your informative answers!
>
>
> You're making this too complicated. In the generic form, there's a single TLB for a CPU, and it's logically
> located on the path between the load/store units and the *entire* memory subsystem (caches, and RAM). The
> CPU, in the process of executing instructions, generates an address for accessing memory. That address is
> logically looked up in the page tables, and is then used to check the cache(s) and/or access memory as needed.
> The TLB is a cache of translations which speeds up the process of producing that translation.
>
> The most common design has only a single TLB for the core, and it's very tightly integrated to the
> load/store units. Some systems have a second level of TLB, to increase the amount of TLB data that
> can be cached, but that's just an "L2 TLB" and is not in any way associated with (say) the L2 cache.
> Some processors have multiple TLBs, because of different access paths for memory items (for example,
> you might have one for instruction fetches, one for integer memory accesses, and one for FP memory
> accesses if those nominally go to a different cache level). Those can be combined in various ways
> too - for example a core might have a TLB on the L1I, and a TLB on the L1D, and then have those
> two TLBs share access to a single L2-TLB. Or there could be two L2-TLBs.
>
> Performance considerations often dictate that the TLB is accessed in parallel with the L1 cache (which
> leads to the common restriction on L1 cache size being the associativity times the page size).
>
> Other designs actually use virtual addresses in some levels of the cache hierarchy, which allows some cache
> lookups to proceed without needing to generate a translation. That complicates things a bit (namely page table
> invalidations and page sharing get more complicated), but doesn't really change the nature of the TLB.
>
> But you don't generally need TLBs past the innermost cache levels, because the further out caches tend
> to use physical addresses (as does RAM), and once you've done the translation (which you'd normally have
> to do before you can finish the L1 cache lookup), you've got, by definition, the physical address, so you
> just keep using that. So there's no TLB associated with the L3 cache or with main memory, rather the TLBs
> are associated with the address generation process, and have to exist in the path between the generation
> of virtual addresses, and the top of the memory hierarchy that each address generator accesses.
>
> How TLBs are managed, and seen by the OS, varies widely. In some systems, there's a defined format of the page
> tables, and when the required translation is not in the TLB, the hardware walks the page tables looking for
> the correct entry (which is then placed in the TLB). Other systems delegate much of the translation process
> to software, and the hardware provides some mechanism to allow the software to place a translation into the
> TLB. Various combinations and modification of those schemes exist. Exactly how visible the actual implementation
> of the TLB is, also varies widely. On some systems (x86, for example), the visible model of the TLB is fairly
> simple, and the OS can mostly* ignore the internal details (IOW, the OS doesn't really care if there are multiple
> TLBs or multiple levels of TLBs). On x86, an OS that correctly managed the TLB on a 386 has an excellent chance
> of working on the latest core from Intel or AMD. On other systems (IPF, in some modes, for example), the OS
> may need to know about each separate TLB, and details about each TLB (number of entries, etc.), and changes
> in those structures may require modifications to the OS's TLB handling. And even if there are multiple TLBs,
> there’s often hidden structure underneath the architecturally visible stuff).
>
> In a sense, you can think about the TLBs imposing two requirements on the rest of the system. First, entries
> need to get put into the TLB in order to allow memory accesses to happen (note that many systems that support
> address translation allow address translation to be turned off, which usually eliminates the need for the
> TLBs to be doing anything - but we're ignoring that mode here). As mentioned above, how entries get put
> in varies quite a bit, from the hardware automatically walking the page tables, to the OS handling an exception
> for a missing translation and modifying a TLB entry itself. Second, and in many ways much more interesting,
> is the utterly critical process of getting stale entries *out* of the TLB, especially across multiple cores
> sharing memory - doing that efficiently can get very complicated.
>
> Many schemes have been proposed, and implemented, to get all of that done
>
> Things like multiple levels of virtual memory (as introduced by virtualization), also complicate things.
>
> Wait, did I say *you* were making things to complicated? ;-)
>
>
>
> *There may well be performance implications for ignoring the internals. For example, some systems have
> different capacities (or even different TLBs) for normal sized pages and large pages. On a system who’s
> TLBs can only cache a small number of large page entries, over-aggressively using large pages can result
> in a substantial performance hit as the large page TLB thrashes. Similarly, understanding the TLB purge
> process may allow an OS to do substantially less work than required by the generic architecture, or to
> optimize the process considerably. For example, there is often an instruction to purge TLB entries associated
> with a particular address (in addition to an instruction to purge the entire TLB), although those often
> (may) purge more entries than are strictly required. Depending on how big the TLB is, and how the specific
> TLB-entry purge instruction works (performance, actual specificity), there’s some crossover point where
> it’s faster to just purge the entire TLB (which is usually pretty quick, but you pay when all the reloads
> happen), vs. purging a bunch of individual entries.
>
Wow thanks! That is definitely a lot of information, I wonder if I can wrap my head around that...
So if I got this correctly...
- The TLB lies in the store components of a core. (I know you said one TLB per CPU... but then you said that the TLB lies in the store components, so you must've meant core, correct?)
- There is not multiple TLBs (like I thought) unless they are tertiary TLBs to the primary TLB, or a microTLB to help sort out the primary TLB.
- The TLB is just a cache of the last acceses to information; their translations and permissions.
- NO caches have a TLB solely for them; though all caches have a page table or multiple in them
- The TLB uses these page tables to track what is inside a cache/RAM and uses the table to translate virtual addresses to physical addresses.
Did I... Get it right? I really hope so, this is going in a completely different direction than I thought it would.
Nonetheless, thank you again for all your help guys!