By: Stubabe (Stubabe.delete@this.nospam.com), August 2, 2013 6:49 pm
Room: Moderated Discussions
Sebastian Soeiro (sebastian_2896.delete@this.hotmail.com) on August 2, 2013 12:58 pm wrote:
> Thank you very much for your explanation! However, there still seems to be some
> unclear things that you didnt answer or I don't understand the answer to;
>
> - It is still unclear as to whether there are multiple TLBs in CPUs or not; with one TLB being
> tightly tied to the L1 cache and the other one being used as a "general purpose" TLB.
>
> - If ALL page tables are in memory, does that mean the CPU must go to the RAM to check the
> page table of the L2 cache? That seems INCREDIBLY inefficient... Shouldnt the L2 cache's
> page table be close to the L2 cache itself? That seems most logical and efficient...
>
> Looking forward to your answer as always!
You seem to be confused by the interaction between caching and TLBs. While many modern CPUs have virtually indexed L1 data caches (as an optimisation) it is probably not helpful to your understanding here. So consider a hypothetical CPU with only physically indexed caches, a single level of TLB, a 32 bit address space, 4KiB pages and a hardware pagewalker:
The CPU decodes an instruction LOAD [0x12345 (74565 decimal)]. Paging is enabled so this is a virtual address i.e. the load unit must translate it to physical 1st. Its queries the TLB (let's say its a miss) so the hardware pagewalker kicks in. It subdivides the address into three parts
The top 10 bits (0000000000) are all zero -> Page Directory Index
The next 10 bits(0000010010) equal 0x12 (18 decimal) -> Page Table Index
The low 12 bits (001101000101) equal 0x345 (837 decimal) -> Page offset
The system register "SYSEXAMPLE1" points to the 4Kib aligned page directory (lets say its in page 5) so to find the page table we need to load the 4byte (32bit) entry from address:
(4096*5) + (4*0) i.e. the 1st (zero) Page Directory Index (see above) in the 4KiB page directory in page 5.
This (as all subsequent memory accesses) may or may not be cached so could miss to RAM - but this is irrelevant to our discussion so for this example just assume all cache levels have just been flushed and we are always accessing RAM. Next we extract the relevant 20bits from this PDE (the other 12 can be used for access/permission flags etc) to obtain the 4Kib aligned page table (lets say its in page 50 decimal) so we then read the 4 byte entry starting from address (given by Page Table Index 18 from above):
(4096*50) + (4*18) and again use 20bits as a page address (this time for the data page itself). Lets say the PTE says the data page is at page 100 decimal. So the final physical address is (4096*100) + (837 i.e. the page offset) = 410437.
So virtual address 0x12345 (74565 decimal) is actually at physical address 0x64345 (410437).
We had to hit memory 3 times JUST TO CALCULATE THE ADDRESS! This is very slow and so we don't have to do it again we cache the result in our TLB (Entry 1 : 0x12000-> 0x64000 note bottom 12bits are not needed as we are working with 4Kib aligned 4Kib sized pages).
Now we can start to actually issue the read to address 0x64345 to the cache memory hierarchy.
What did you notice about the role of the caches in this process? Apart from possibly caching a few page tables they had no role what so ever! Also until the AGU/Load unit has completed the translation the caches are not even aware of what we are asking for (they are indexed by the physical address just like RAM) so the TLBs have not had a direct bearing on the caches either (although the page walker issues PTE loads via them i.e. it is a client of the caches itself). The TLBs and caches are two totally unrelated structures on our chip. So if you are asking what TLBs cache level1/2/whatever has the answer is ALWAYS NONE as caches don't own the TLBs the load/store unit does. i.e. TLB access is upstream of cache access in the memory pipeline. All the caches in this chip are physically indexed so do not care what goes into the TLB at all, only the resulting physical address is used in cache indexing, memory addressing etc. That means the caches have no need of the TLB data to track their contents or read/flush to RAM.
------------------------------------------------------------------------------------
Now if you actually get that (and yes page walking is incredibly inefficient that's why we have TLBs) then we can consider a virtually indexed, physically tagged L1 cache. Caches are composed of fixed sized caches lines arranged in sets of ways. Consider a virtually indexed, 64 byte line, 32KiB, 8-way cache.
There are 32768/64 = 512 cache lines in our cache
The are 512/8 = 64 sets in our cache of 8 ways each
So the low 6bits (0-5) of an address are just the byte offset in the 64byte cache line.
The next 6bits (6-11) give us our set. Each set has 8 possible cache lines for us to go in and these are managed in a pseudo LRU type algorithm. Notice we need only the low 12bits of the address (same as the 4kiB page offset above - this is not an accident) to find our place in the cache. To determine the correct way (since many different addresses have the same bits 6-11) we need to check the TAG held against the cache line as this holds the full physical address (minus the 12bits covered above) so we can see which way (if any) holds our data by checking the 8 TAGs in our set in parallel.
TLBs are also often either arranged an N-way CAM type memory or in sets and ways like a cache and can often take nearly as long to lookup as a L1 cache fetch. So rather than index the cache with the physical address from the TLB we can index the L1 with the pre-translated virtual address so we can start it in parallel with the TLB lookup. Note we still need to get the physical address back to do the final compare with the TAG but this is an optimisation that can save a few clocks load to use latency (if we get a TLB hit). However, it only makes sense for the L1 since we already have the physical address before we need to query the L2 (i.e. a miss in L1) so there is no advantage in making the L2/3 virtually indexed. So please don't confuse this clever "cheat" as some kind of TLBcache association, it isn't.
Ideally a CPU should have enough TLBs to completely cover all its caches but this is not always the case since as you make the TLB bigger it gets slower. So often you have a second larger and slower (L2) TLB that is queried on a miss to the L1 TLB (probably in parallel to the page walker starting up). Also you may choose to cache PDE entries (i.e. page table addresses) in your TLB or even in a special cache for the page walker. But all of this is implementation detail and doesn't really change the core algorithm.
> Thank you very much for your explanation! However, there still seems to be some
> unclear things that you didnt answer or I don't understand the answer to;
>
> - It is still unclear as to whether there are multiple TLBs in CPUs or not; with one TLB being
> tightly tied to the L1 cache and the other one being used as a "general purpose" TLB.
>
> - If ALL page tables are in memory, does that mean the CPU must go to the RAM to check the
> page table of the L2 cache? That seems INCREDIBLY inefficient... Shouldnt the L2 cache's
> page table be close to the L2 cache itself? That seems most logical and efficient...
>
> Looking forward to your answer as always!
You seem to be confused by the interaction between caching and TLBs. While many modern CPUs have virtually indexed L1 data caches (as an optimisation) it is probably not helpful to your understanding here. So consider a hypothetical CPU with only physically indexed caches, a single level of TLB, a 32 bit address space, 4KiB pages and a hardware pagewalker:
The CPU decodes an instruction LOAD [0x12345 (74565 decimal)]. Paging is enabled so this is a virtual address i.e. the load unit must translate it to physical 1st. Its queries the TLB (let's say its a miss) so the hardware pagewalker kicks in. It subdivides the address into three parts
The top 10 bits (0000000000) are all zero -> Page Directory Index
The next 10 bits(0000010010) equal 0x12 (18 decimal) -> Page Table Index
The low 12 bits (001101000101) equal 0x345 (837 decimal) -> Page offset
The system register "SYSEXAMPLE1" points to the 4Kib aligned page directory (lets say its in page 5) so to find the page table we need to load the 4byte (32bit) entry from address:
(4096*5) + (4*0) i.e. the 1st (zero) Page Directory Index (see above) in the 4KiB page directory in page 5.
This (as all subsequent memory accesses) may or may not be cached so could miss to RAM - but this is irrelevant to our discussion so for this example just assume all cache levels have just been flushed and we are always accessing RAM. Next we extract the relevant 20bits from this PDE (the other 12 can be used for access/permission flags etc) to obtain the 4Kib aligned page table (lets say its in page 50 decimal) so we then read the 4 byte entry starting from address (given by Page Table Index 18 from above):
(4096*50) + (4*18) and again use 20bits as a page address (this time for the data page itself). Lets say the PTE says the data page is at page 100 decimal. So the final physical address is (4096*100) + (837 i.e. the page offset) = 410437.
So virtual address 0x12345 (74565 decimal) is actually at physical address 0x64345 (410437).
We had to hit memory 3 times JUST TO CALCULATE THE ADDRESS! This is very slow and so we don't have to do it again we cache the result in our TLB (Entry 1 : 0x12000-> 0x64000 note bottom 12bits are not needed as we are working with 4Kib aligned 4Kib sized pages).
Now we can start to actually issue the read to address 0x64345 to the cache memory hierarchy.
What did you notice about the role of the caches in this process? Apart from possibly caching a few page tables they had no role what so ever! Also until the AGU/Load unit has completed the translation the caches are not even aware of what we are asking for (they are indexed by the physical address just like RAM) so the TLBs have not had a direct bearing on the caches either (although the page walker issues PTE loads via them i.e. it is a client of the caches itself). The TLBs and caches are two totally unrelated structures on our chip. So if you are asking what TLBs cache level1/2/whatever has the answer is ALWAYS NONE as caches don't own the TLBs the load/store unit does. i.e. TLB access is upstream of cache access in the memory pipeline. All the caches in this chip are physically indexed so do not care what goes into the TLB at all, only the resulting physical address is used in cache indexing, memory addressing etc. That means the caches have no need of the TLB data to track their contents or read/flush to RAM.
------------------------------------------------------------------------------------
Now if you actually get that (and yes page walking is incredibly inefficient that's why we have TLBs) then we can consider a virtually indexed, physically tagged L1 cache. Caches are composed of fixed sized caches lines arranged in sets of ways. Consider a virtually indexed, 64 byte line, 32KiB, 8-way cache.
There are 32768/64 = 512 cache lines in our cache
The are 512/8 = 64 sets in our cache of 8 ways each
So the low 6bits (0-5) of an address are just the byte offset in the 64byte cache line.
The next 6bits (6-11) give us our set. Each set has 8 possible cache lines for us to go in and these are managed in a pseudo LRU type algorithm. Notice we need only the low 12bits of the address (same as the 4kiB page offset above - this is not an accident) to find our place in the cache. To determine the correct way (since many different addresses have the same bits 6-11) we need to check the TAG held against the cache line as this holds the full physical address (minus the 12bits covered above) so we can see which way (if any) holds our data by checking the 8 TAGs in our set in parallel.
TLBs are also often either arranged an N-way CAM type memory or in sets and ways like a cache and can often take nearly as long to lookup as a L1 cache fetch. So rather than index the cache with the physical address from the TLB we can index the L1 with the pre-translated virtual address so we can start it in parallel with the TLB lookup. Note we still need to get the physical address back to do the final compare with the TAG but this is an optimisation that can save a few clocks load to use latency (if we get a TLB hit). However, it only makes sense for the L1 since we already have the physical address before we need to query the L2 (i.e. a miss in L1) so there is no advantage in making the L2/3 virtually indexed. So please don't confuse this clever "cheat" as some kind of TLBcache association, it isn't.
Ideally a CPU should have enough TLBs to completely cover all its caches but this is not always the case since as you make the TLB bigger it gets slower. So often you have a second larger and slower (L2) TLB that is queried on a miss to the L1 TLB (probably in parallel to the page walker starting up). Also you may choose to cache PDE entries (i.e. page table addresses) in your TLB or even in a special cache for the page walker. But all of this is implementation detail and doesn't really change the core algorithm.