By: rwessel (robertwessel.delete@this.yahoo.com), August 1, 2013 12:21 am
Room: Moderated Discussions
Sebastian Soeiro (sebastian_2896.delete@this.hotmail.com) on July 31, 2013 9:12 pm wrote:
> Paul A. Clayton (paaronclayton.delete@this.gmail.com) on July 31, 2013 4:11 pm wrote:
> > Sebastian Soeiro (sebastian_2896.delete@this.hotmail.com) on July 31, 2013 2:15 pm wrote:
> > [snip]
> >
> > > - The TLB lies in the store components of a core. (I know you said one TLB per CPU... but then
> > > you said that the TLB lies in the store components, so you must've meant core, correct?)
> >
> > To get an idea of where TLBs are physically located, you can take a look at CPU floorplans
> > (doing a web image search for "CPU core floorplan" should provide some interesting viewing).
> > Hans de Vries has a number of such images (and some articles with analysis). The following
> > is probably a helpful example: http://www.chip-architect.com/news/K8L_floorplan.jpg
> >
> > > - There is not multiple TLBs (like I thought) unless they are tertiary TLBs
> > > to the primary TLB, or a microTLB to help sort out the primary TLB.
> >
> > TLBs are not associated with levels of the (ordinary) cache,
> > but like cache can be divided into multiple levels
> > and be shared or private between instruction and data accesses (and even between different cores).
> >
> > > - The TLB is just a cache of the last acceses to information; their translations and permissions.
> >
> > Yep! The "authoritative" source for such information is the page table(s) and TLBs are caches of page table
> > information. (Intel calls the structure which stores internal
> > node entries in its multi-level page table Paging-Structure
> > Caches, so Intel might view a TLB as only holding Page Table Entries and not any Page Directory Entries
> > and other parts of the hierarchical page table. Since Intel
> > has not implemented a structure which caches both
> > types of page table components, it is not obvious what
> > Intel would call such a hardware structure. I seem to
> > recall Robert Wessel [on the comp.arch newsgroup, I think]
> > mentioning some time ago that some implementation
> > of IBM mainframes did have TLBs that cached more than just PTEs--and called them TLBs.)
> >
> > > - NO caches have a TLB solely for them; though all caches have a page table or multiple in them
> >
> > The L1 TLBs are somewhat tightly bound to the L1 caches in typical processors
> > (that have TLBs) to allow minimal latency in L1 cache accesses. However, there
> > is typically not a special TLB for the L2 cache nor another for the L3 cache.
> >
> > Page tables are conceptually stored in main memory, but like other parts of main memory can be found
> > in caches. (Theoretically, pages of a page table can even be paged out to swap like other memory.)
> >
> > In a typical system, each software process has a separate virtual address space requiring a separate
> > page table. (Threads within a process share address space/page table.) The OS typically takes a
> > portion of each process' total address space as a virtual memory region shared across the system.
> > Mechanisms exist to to avoid duplicating all of the information in this region (e.g., the ARM ISA
> > defines a global address space with a mask determining what accesses are within the per-process
> > address space and which are in the global address space). (Eep! Excess digression!)
> >
> > > - The TLB uses these page tables to track what is inside a cache/RAM and
> > > uses the table to translate virtual addresses to physical addresses.
> >
> > The TLBs are (typically) not concerned with what is present
> > in the caches and only provide information to translate
> > virtual addresses to physical address and some other metadata
> > like permissions, accessed and dirty bits, cache
> > write-through/write-back behavior, etc. Each virtual address
> > page (typically an aligned chunk of memory about
> > 4KiB in size) is associated with a Page Table Entry (typically
> > 64-bits in size for 64-bit address spaces). The
> > page table entries are stored in the page table and the information is cached in the TLBs.
> >
> > A PTE can have its validity bit cleared to indicate that the page associated with the virtual
> > address is not in memory, but it is the operating system software that handles such. (Some
> > specifics of how the TLB handles invalid PTEs depend on the choices made by the developers
> > of the architecture, but obviously any non-speculative access must generate an exception
> > so that the OS can load the appropriate page and set the valid bit of the PTE.)
> >
> > > Did I... Get it right? I really hope so, this is going in
> > > a completely different direction than I thought it would.
> >
> > It looks like you have got it mostly right.
> >
> > As with any area of knowledge, the more one learns, the more one discovers
> > how much one does not know (and how interconnected knowledge is).
> >
> > (In my reading about computer architecture, I have concentrated somewhat on caches and TLBs because
> > these areas seem to be somewhat more accessible--lack of circuit design or programming knowledge is
> > not a major issue for reading most of the academic papers--,
> > the way they impact performance is relatively
> > straightforward, and they have a significant impact on performance and power efficiency.)
>
> Ohhh so this is starting to make a bit more sense now; thanks for your explanations!
>
> - So it really does look like the TLBs are physically located extremely
> close to the back-end of the CPU architecture. Interesting!
Logically, and assuming all physically addressed caches, you can think of the TLBs as being between the load/store units and the L1 cache. And then the process is: the load store unit generates a load or store to a virtual address, that address is translated by the TLB to a physical address, that physical address is then used to check the L1 cache to see if the data is there. If the data is not there, the physical address is used to check the L2, L3, L4..., and then finally actually access RAM if the data item is not found in any of the caches. So the translation needs to happen only once.
If the translation is not in the TLB, the process stops until the data needed for the translation *does* get to the TLB. On some processors, like x86, the hardware will walk the page tables in memory looking for the translation, and generate an exception if they can't find one. But if they do find it, it gets loaded, and the executing program sees no effect from the walk process (except for the slowdown). On other processors, there's no automatic page table walking, and there's always an exception on a TLB miss, and the OS has to load the required TLB entry. In either case, the OS can restart the instruction that was aborted because of a missing translation (for example, that is how virtual memory is handled - after the required page is paged back in, the translation is put into the page tables, and you re-execute the instruction.)
Some processors have virtually addressed (indexed) caches, at least at the lower levels (nearer the CPU), so that the cache lookup is done via virtual, rather than physical addresses. This has the advantage that you can access those levels of cache *without* doing the translation, although that introduces a number of unpleasant problems managing pages with multiple translations (virtual address aliases occur fairly often – for example, an object mapped into more than one address space). So virtually addressed caches are greatly liked by the hardware folks (as they greatly simply a very performance sensitive portion of the CPU), and detested by OS folks (as they introduce huge complexities and inefficiencies in the management of virtual memory). In any event, there is still a TLB, but the point where it needs to be accessed just moves to behind the last level of virtually indexed cache (usually only the first level is virtually indexed).
> - So if the L1 TLB is tightly bound to the L1 Cache, does that mean that there is a seperate TLB that does
> the translation/permission caching for the memory structures L2 cache through main memory? Or does the L1
> Cache do double (quadruple?) duty and store info for ALL memory structuers... The core floormap seems to
> have two TLBs; or atleast the core floormap that you provided previously (thanks again for that)
TLBs and L1 caches are commonly closely tied together because that allows an implementation trick that provides a significant performance benefit. Namely you can overlap a significant part of the TLB access and the check of the L1, rather than doing the purely sequentially sequential process described above. If the L1 is small enough, the parts of the address used to index into the cache comes solely from the bits at the left end of the address that are not affected by translation. IOW, if you can compute the cache index with only the low 12 bits (and you've got the usual 4KB pages), you can read the cache line *before* doing the translation, as those 12 bits are *not* changed by the translation.
So what you do is the translation (via the TLB) in parallel with the read of the cache line. Once *both* of those are done, you can compare the tags on each entry in the cache line with the *translated* address, and finish the cache lookup process.
So that effectively gets you the best of both worlds - the ease of use of a physically indexed cache with the performance of a virtually indexed cache (assuming you can complete the TLB lookup during the cache line read).
The downside is that there are not that many bits available to do that - just 12 with 4KB pages (all the other bits are modified during translation, and so are not available early). So you cannot address all that much cache. In fact with a direct mapped cache, you could only have a 4KB L1 if you limited the index to the low 12 bits. This can be improved by increasing the associativity of the cache. For example, if you have a four-way set associative cache, so that each cache line can store four cache entries, you can have an L1 of 16KB (4 ways times 4KB). Increasing the associativity, at least up to about 4- or 8-way, also significantly increases the hit rate of the cache (by reducing the number of conflict misses), so a 4- or 8-way L1 is pretty common. Which explains why 16 and 32KB are common L1 cache sizes.
Note that the number of entries in a cache line is not very tied to binary addressing, so it's not uncommon (or at least it was not uncommon in the past) to see (for example) 6-way set associative caches, if that meets dies size constraints.
Some processors have used unusually large degrees of associativity to avoid a multi-level cache structure. The 3081, for example, had 64KB caches (single level only, so you'd not usually call it an L1, but that would not be wrong). They did that with 16-way set associative caches, despite the added complexity that caused, for little or no improvement in cache performance. *But* the designers decided that a more complex single level cache of 64KB was more suited to their goals than a 16KB L1 backed with a (say) 256KB L2 (which would probably provide similar performance).
> - Page tables are conceptually stored in main memory? Does that mean that the L2 and L3 cache do
> not have their own page table and must be physically walked through EVERY time something wants
> to be accessed in them? That seems needlessly inefficient; I gotta be missing something here.
The page tables are typically in memory. The page tables are *conceptually* walked for every memory access, logically, before the L1 cache probe. The TLB short circuits that for (hopefully!) the vast majority of translations. And once the translation is done, you don't need to do it again - so if you do it as part of the L1 probe, you can reuse it for the L2, L3, L4... probes, and, if necessary, the access to actual RAM.
As we discussed earlier this can be complicated by different access paths to memory by the CPU. Separate instruction and data L1s are common, and it's common to provide separate TLBs for those as well. That has an added benefit of allowing you to increase the effective size of the TLB (which is very performance critical), without increasing its physical size (which *will* make it slower). Basically by having two TLBs (instruction and data), each can be the smaller, faster "size", yet you have twice as many entries (assuming equal sizes). Also having separate TLBs allows both to operate in parallel, improving throughput.
Page tables can also typically be cached by the ordinary caches. When the TLB misses, whatever process walks to page tables is just doing fairly ordinary memory accesses (typically via real/physicals, and not subject to translation), and if those happen to hit in cache, great.
That's also one of the reasons why there has been less pressure to go to complex multi-level TLBs. In a practical sense the cache subsystem *does* act like a last level TLB, in that it can hold the page table entries (in raw form, anyway). So fetching a page table entry from the (say) L3 will be much faster than fetching it from main memory, and the L3 will usually be massively larger than your TLB structures (the biggest multi-level TLBs out there hold on the order of thousands of TLB entries, there are L3s out there that can hold millions of page table entries - although that competes with other data to be cached).
> Paul A. Clayton (paaronclayton.delete@this.gmail.com) on July 31, 2013 4:11 pm wrote:
> > Sebastian Soeiro (sebastian_2896.delete@this.hotmail.com) on July 31, 2013 2:15 pm wrote:
> > [snip]
> >
> > > - The TLB lies in the store components of a core. (I know you said one TLB per CPU... but then
> > > you said that the TLB lies in the store components, so you must've meant core, correct?)
> >
> > To get an idea of where TLBs are physically located, you can take a look at CPU floorplans
> > (doing a web image search for "CPU core floorplan" should provide some interesting viewing).
> > Hans de Vries has a number of such images (and some articles with analysis). The following
> > is probably a helpful example: http://www.chip-architect.com/news/K8L_floorplan.jpg
> >
> > > - There is not multiple TLBs (like I thought) unless they are tertiary TLBs
> > > to the primary TLB, or a microTLB to help sort out the primary TLB.
> >
> > TLBs are not associated with levels of the (ordinary) cache,
> > but like cache can be divided into multiple levels
> > and be shared or private between instruction and data accesses (and even between different cores).
> >
> > > - The TLB is just a cache of the last acceses to information; their translations and permissions.
> >
> > Yep! The "authoritative" source for such information is the page table(s) and TLBs are caches of page table
> > information. (Intel calls the structure which stores internal
> > node entries in its multi-level page table Paging-Structure
> > Caches, so Intel might view a TLB as only holding Page Table Entries and not any Page Directory Entries
> > and other parts of the hierarchical page table. Since Intel
> > has not implemented a structure which caches both
> > types of page table components, it is not obvious what
> > Intel would call such a hardware structure. I seem to
> > recall Robert Wessel [on the comp.arch newsgroup, I think]
> > mentioning some time ago that some implementation
> > of IBM mainframes did have TLBs that cached more than just PTEs--and called them TLBs.)
> >
> > > - NO caches have a TLB solely for them; though all caches have a page table or multiple in them
> >
> > The L1 TLBs are somewhat tightly bound to the L1 caches in typical processors
> > (that have TLBs) to allow minimal latency in L1 cache accesses. However, there
> > is typically not a special TLB for the L2 cache nor another for the L3 cache.
> >
> > Page tables are conceptually stored in main memory, but like other parts of main memory can be found
> > in caches. (Theoretically, pages of a page table can even be paged out to swap like other memory.)
> >
> > In a typical system, each software process has a separate virtual address space requiring a separate
> > page table. (Threads within a process share address space/page table.) The OS typically takes a
> > portion of each process' total address space as a virtual memory region shared across the system.
> > Mechanisms exist to to avoid duplicating all of the information in this region (e.g., the ARM ISA
> > defines a global address space with a mask determining what accesses are within the per-process
> > address space and which are in the global address space). (Eep! Excess digression!)
> >
> > > - The TLB uses these page tables to track what is inside a cache/RAM and
> > > uses the table to translate virtual addresses to physical addresses.
> >
> > The TLBs are (typically) not concerned with what is present
> > in the caches and only provide information to translate
> > virtual addresses to physical address and some other metadata
> > like permissions, accessed and dirty bits, cache
> > write-through/write-back behavior, etc. Each virtual address
> > page (typically an aligned chunk of memory about
> > 4KiB in size) is associated with a Page Table Entry (typically
> > 64-bits in size for 64-bit address spaces). The
> > page table entries are stored in the page table and the information is cached in the TLBs.
> >
> > A PTE can have its validity bit cleared to indicate that the page associated with the virtual
> > address is not in memory, but it is the operating system software that handles such. (Some
> > specifics of how the TLB handles invalid PTEs depend on the choices made by the developers
> > of the architecture, but obviously any non-speculative access must generate an exception
> > so that the OS can load the appropriate page and set the valid bit of the PTE.)
> >
> > > Did I... Get it right? I really hope so, this is going in
> > > a completely different direction than I thought it would.
> >
> > It looks like you have got it mostly right.
> >
> > As with any area of knowledge, the more one learns, the more one discovers
> > how much one does not know (and how interconnected knowledge is).
> >
> > (In my reading about computer architecture, I have concentrated somewhat on caches and TLBs because
> > these areas seem to be somewhat more accessible--lack of circuit design or programming knowledge is
> > not a major issue for reading most of the academic papers--,
> > the way they impact performance is relatively
> > straightforward, and they have a significant impact on performance and power efficiency.)
>
> Ohhh so this is starting to make a bit more sense now; thanks for your explanations!
>
> - So it really does look like the TLBs are physically located extremely
> close to the back-end of the CPU architecture. Interesting!
Logically, and assuming all physically addressed caches, you can think of the TLBs as being between the load/store units and the L1 cache. And then the process is: the load store unit generates a load or store to a virtual address, that address is translated by the TLB to a physical address, that physical address is then used to check the L1 cache to see if the data is there. If the data is not there, the physical address is used to check the L2, L3, L4..., and then finally actually access RAM if the data item is not found in any of the caches. So the translation needs to happen only once.
If the translation is not in the TLB, the process stops until the data needed for the translation *does* get to the TLB. On some processors, like x86, the hardware will walk the page tables in memory looking for the translation, and generate an exception if they can't find one. But if they do find it, it gets loaded, and the executing program sees no effect from the walk process (except for the slowdown). On other processors, there's no automatic page table walking, and there's always an exception on a TLB miss, and the OS has to load the required TLB entry. In either case, the OS can restart the instruction that was aborted because of a missing translation (for example, that is how virtual memory is handled - after the required page is paged back in, the translation is put into the page tables, and you re-execute the instruction.)
Some processors have virtually addressed (indexed) caches, at least at the lower levels (nearer the CPU), so that the cache lookup is done via virtual, rather than physical addresses. This has the advantage that you can access those levels of cache *without* doing the translation, although that introduces a number of unpleasant problems managing pages with multiple translations (virtual address aliases occur fairly often – for example, an object mapped into more than one address space). So virtually addressed caches are greatly liked by the hardware folks (as they greatly simply a very performance sensitive portion of the CPU), and detested by OS folks (as they introduce huge complexities and inefficiencies in the management of virtual memory). In any event, there is still a TLB, but the point where it needs to be accessed just moves to behind the last level of virtually indexed cache (usually only the first level is virtually indexed).
> - So if the L1 TLB is tightly bound to the L1 Cache, does that mean that there is a seperate TLB that does
> the translation/permission caching for the memory structures L2 cache through main memory? Or does the L1
> Cache do double (quadruple?) duty and store info for ALL memory structuers... The core floormap seems to
> have two TLBs; or atleast the core floormap that you provided previously (thanks again for that)
TLBs and L1 caches are commonly closely tied together because that allows an implementation trick that provides a significant performance benefit. Namely you can overlap a significant part of the TLB access and the check of the L1, rather than doing the purely sequentially sequential process described above. If the L1 is small enough, the parts of the address used to index into the cache comes solely from the bits at the left end of the address that are not affected by translation. IOW, if you can compute the cache index with only the low 12 bits (and you've got the usual 4KB pages), you can read the cache line *before* doing the translation, as those 12 bits are *not* changed by the translation.
So what you do is the translation (via the TLB) in parallel with the read of the cache line. Once *both* of those are done, you can compare the tags on each entry in the cache line with the *translated* address, and finish the cache lookup process.
So that effectively gets you the best of both worlds - the ease of use of a physically indexed cache with the performance of a virtually indexed cache (assuming you can complete the TLB lookup during the cache line read).
The downside is that there are not that many bits available to do that - just 12 with 4KB pages (all the other bits are modified during translation, and so are not available early). So you cannot address all that much cache. In fact with a direct mapped cache, you could only have a 4KB L1 if you limited the index to the low 12 bits. This can be improved by increasing the associativity of the cache. For example, if you have a four-way set associative cache, so that each cache line can store four cache entries, you can have an L1 of 16KB (4 ways times 4KB). Increasing the associativity, at least up to about 4- or 8-way, also significantly increases the hit rate of the cache (by reducing the number of conflict misses), so a 4- or 8-way L1 is pretty common. Which explains why 16 and 32KB are common L1 cache sizes.
Note that the number of entries in a cache line is not very tied to binary addressing, so it's not uncommon (or at least it was not uncommon in the past) to see (for example) 6-way set associative caches, if that meets dies size constraints.
Some processors have used unusually large degrees of associativity to avoid a multi-level cache structure. The 3081, for example, had 64KB caches (single level only, so you'd not usually call it an L1, but that would not be wrong). They did that with 16-way set associative caches, despite the added complexity that caused, for little or no improvement in cache performance. *But* the designers decided that a more complex single level cache of 64KB was more suited to their goals than a 16KB L1 backed with a (say) 256KB L2 (which would probably provide similar performance).
> - Page tables are conceptually stored in main memory? Does that mean that the L2 and L3 cache do
> not have their own page table and must be physically walked through EVERY time something wants
> to be accessed in them? That seems needlessly inefficient; I gotta be missing something here.
The page tables are typically in memory. The page tables are *conceptually* walked for every memory access, logically, before the L1 cache probe. The TLB short circuits that for (hopefully!) the vast majority of translations. And once the translation is done, you don't need to do it again - so if you do it as part of the L1 probe, you can reuse it for the L2, L3, L4... probes, and, if necessary, the access to actual RAM.
As we discussed earlier this can be complicated by different access paths to memory by the CPU. Separate instruction and data L1s are common, and it's common to provide separate TLBs for those as well. That has an added benefit of allowing you to increase the effective size of the TLB (which is very performance critical), without increasing its physical size (which *will* make it slower). Basically by having two TLBs (instruction and data), each can be the smaller, faster "size", yet you have twice as many entries (assuming equal sizes). Also having separate TLBs allows both to operate in parallel, improving throughput.
Page tables can also typically be cached by the ordinary caches. When the TLB misses, whatever process walks to page tables is just doing fairly ordinary memory accesses (typically via real/physicals, and not subject to translation), and if those happen to hit in cache, great.
That's also one of the reasons why there has been less pressure to go to complex multi-level TLBs. In a practical sense the cache subsystem *does* act like a last level TLB, in that it can hold the page table entries (in raw form, anyway). So fetching a page table entry from the (say) L3 will be much faster than fetching it from main memory, and the L3 will usually be massively larger than your TLB structures (the biggest multi-level TLBs out there hold on the order of thousands of TLB entries, there are L3s out there that can hold millions of page table entries - although that competes with other data to be cached).