By: , August 3, 2013 8:23 am
Room: Moderated Discussions
Stubabe (Stubabe.delete@this.nospam.com) on August 2, 2013 7:49 pm wrote:
> Sebastian Soeiro (sebastian_2896.delete@this.hotmail.com) on August 2, 2013 12:58 pm wrote:
>
> > Thank you very much for your explanation! However, there still seems to be some
> > unclear things that you didnt answer or I don't understand the answer to;
> >
> > - It is still unclear as to whether there are multiple TLBs in CPUs or not; with one TLB being
> > tightly tied to the L1 cache and the other one being used as a "general purpose" TLB.
> >
> > - If ALL page tables are in memory, does that mean the CPU must go to the RAM to check the
> > page table of the L2 cache? That seems INCREDIBLY inefficient... Shouldnt the L2 cache's
> > page table be close to the L2 cache itself? That seems most logical and efficient...
> >
> > Looking forward to your answer as always!
>
> You seem to be confused by the interaction between caching and TLBs. While many modern CPUs
> have virtually indexed L1 data caches (as an optimisation) it is probably not helpful to your
> understanding here. So consider a hypothetical CPU with only physically indexed caches, a
> single level of TLB, a 32 bit address space, 4KiB pages and a hardware pagewalker:
>
> The CPU decodes an instruction LOAD [0x12345 (74565 decimal)]. Paging is enabled so this is a virtual
> address i.e. the load unit must translate it to physical 1st. Its queries the TLB (let's say its
> a miss) so the hardware pagewalker kicks in. It subdivides the address into three parts
>
> The top 10 bits (0000000000) are all zero -> Page Directory Index
> The next 10 bits(0000010010) equal 0x12 (18 decimal) -> Page Table Index
> The low 12 bits (001101000101) equal 0x345 (837 decimal) -> Page offset
>
> The system register "SYSEXAMPLE1" points to the 4Kib aligned page directory (lets say its in
> page 5) so to find the page table we need to load the 4byte (32bit) entry from address:
> (4096*5) + (4*0) i.e. the 1st (zero) Page Directory Index (see above) in the 4KiB page directory in page 5.
>
> This (as all subsequent memory accesses) may or may not be cached so could miss to RAM - but this is irrelevant
> to our discussion so for this example just assume all cache levels have just been flushed and we are always
> accessing RAM. Next we extract the relevant 20bits from this PDE (the other 12 can be used for access/permission
> flags etc) to obtain the 4Kib aligned page table (lets say its in page 50 decimal) so we then read the
> 4 byte entry starting from address (given by Page Table Index 18 from above):
> (4096*50) + (4*18) and again use 20bits as a page address (this time for the data
> page itself). Lets say the PTE says the data page is at page 100 decimal. So the
> final physical address is (4096*100) + (837 i.e. the page offset) = 410437.
>
> So virtual address 0x12345 (74565 decimal) is actually at physical address 0x64345 (410437).
> We had to hit memory 3 times JUST TO CALCULATE THE ADDRESS! This is very slow and so we
> don't have to do it again we cache the result in our TLB (Entry 1 : 0x12000-> 0x64000 note
> bottom 12bits are not needed as we are working with 4Kib aligned 4Kib sized pages).
> Now we can start to actually issue the read to address 0x64345 to the cache memory hierarchy.
>
> What did you notice about the role of the caches in this process? Apart from possibly caching a few page
> tables they had no role what so ever! Also until the AGU/Load unit has completed the translation the caches
> are not even aware of what we are asking for (they are indexed by the physical address just like RAM) so
> the TLBs have not had a direct bearing on the caches either (although the page walker issues PTE loads
> via them i.e. it is a client of the caches itself). The TLBs and caches are two totally unrelated structures
> on our chip. So if you are asking what TLBs cache level1/2/whatever has the answer is ALWAYS NONE as caches
> don't own the TLBs the load/store unit does. i.e. TLB access is upstream of cache access in the memory
> pipeline. All the caches in this chip are physically indexed so do not care what goes into the TLB at all,
> only the resulting physical address is used in cache indexing, memory addressing etc. That means the caches
> have no need of the TLB data to track their contents or read/flush to RAM.
>
> ------------------------------------------------------------------------------------
> Now if you actually get that (and yes page walking is incredibly inefficient that's why we have TLBs) then
> we can consider a virtually indexed, physically tagged L1 cache. Caches are composed of fixed sized caches
> lines arranged in sets of ways. Consider a virtually indexed, 64 byte line, 32KiB, 8-way cache.
>
> There are 32768/64 = 512 cache lines in our cache
> The are 512/8 = 64 sets in our cache of 8 ways each
>
> So the low 6bits (0-5) of an address are just the byte offset in the 64byte cache line.
> The next 6bits (6-11) give us our set. Each set has 8 possible cache lines for us to go in and these
> are managed in a pseudo LRU type algorithm. Notice we need only the low 12bits of the address (same
> as the 4kiB page offset above - this is not an accident) to find our place in the cache. To determine
> the correct way (since many different addresses have the same bits 6-11) we need to check the TAG held
> against the cache line as this holds the full physical address (minus the 12bits covered above) so
> we can see which way (if any) holds our data by checking the 8 TAGs in our set in parallel.
>
> TLBs are also often either arranged an N-way CAM type memory or in sets and ways like a cache and can
> often take nearly as long to lookup as a L1 cache fetch. So rather than index the cache with the physical
> address from the TLB we can index the L1 with the pre-translated virtual address so we can start it in
> parallel with the TLB lookup. Note we still need to get the physical address back to do the final compare
> with the TAG but this is an optimisation that can save a few clocks load to use latency (if we get a TLB
> hit). However, it only makes sense for the L1 since we already have the physical address before we need
> to query the L2 (i.e. a miss in L1) so there is no advantage in making the L2/3 virtually indexed. So
> please don't confuse this clever "cheat" as some kind of TLBcache association, it isn't.
>
> Ideally a CPU should have enough TLBs to completely cover all its caches but this is not always the case
> since as you make the TLB bigger it gets slower. So often you have a second larger and slower (L2) TLB that
> is queried on a miss to the L1 TLB (probably in parallel to the page walker starting up). Also you may choose
> to cache PDE entries (i.e. page table addresses) in your TLB or even in a special cache for the page walker.
> But all of this is implementation detail and doesn't really change the core algorithm.
>
Oh boy... Well that is certainly a LOT to take in... Let's see if I can recite what you guys said in my own words and understanding. There are still some "blanks" though... Perhaps you guys can help fill them in with me?
So an instruction asks the LSU to get ahold of "#12345678", and it asks the LSU to translate the virtual address into a physical address (here is a blank. How does it do this? What unit translates the virtual address into a physical address, and how?), but at the same time, the LSU can also try loading a line (a line? Heres one of my other blanks; (I have 4 in total. A lot I know, sorry; I'm not the sharpest tool in the shed, but I try to be ':) ) Why does taking one line and comparing it to the translated physical address the equivalent of comparing it to the entire L1 cache? Isnt the L1 cache larger than one line?) and then compares the physical address to the cache line, and if they match, then the data is found and the translation goes to the TLB.
If it is not found, then it moves to the L2 cache and uses the physical address there (heres where two of my other blanks comes in. What does it use to look through the L2 and L3 cache? The page walker? Also, if the physical address is already found, why does the LSU have to bother going from L1>L2>L3>RAM when it can just go directly to where the physical address is since it already has it? Or does the physical address have no notation of where it is in the physical memory hiearchy?) and if it finds the info, then the data is found; if not, it moves onto the L3.
The process is repeated and if the data is not found, it moves to RAM.
And heres my last blank. It doesnt do the same process in the RAM; does it? Does it use the page tables to look up whats in RAM? Im now confused as to what the page tables do...
Sorry for probably missing a lot of what you guys said; I'm a little slow, so please bare with me here. Thanks for the great help!
> Sebastian Soeiro (sebastian_2896.delete@this.hotmail.com) on August 2, 2013 12:58 pm wrote:
>
> > Thank you very much for your explanation! However, there still seems to be some
> > unclear things that you didnt answer or I don't understand the answer to;
> >
> > - It is still unclear as to whether there are multiple TLBs in CPUs or not; with one TLB being
> > tightly tied to the L1 cache and the other one being used as a "general purpose" TLB.
> >
> > - If ALL page tables are in memory, does that mean the CPU must go to the RAM to check the
> > page table of the L2 cache? That seems INCREDIBLY inefficient... Shouldnt the L2 cache's
> > page table be close to the L2 cache itself? That seems most logical and efficient...
> >
> > Looking forward to your answer as always!
>
> You seem to be confused by the interaction between caching and TLBs. While many modern CPUs
> have virtually indexed L1 data caches (as an optimisation) it is probably not helpful to your
> understanding here. So consider a hypothetical CPU with only physically indexed caches, a
> single level of TLB, a 32 bit address space, 4KiB pages and a hardware pagewalker:
>
> The CPU decodes an instruction LOAD [0x12345 (74565 decimal)]. Paging is enabled so this is a virtual
> address i.e. the load unit must translate it to physical 1st. Its queries the TLB (let's say its
> a miss) so the hardware pagewalker kicks in. It subdivides the address into three parts
>
> The top 10 bits (0000000000) are all zero -> Page Directory Index
> The next 10 bits(0000010010) equal 0x12 (18 decimal) -> Page Table Index
> The low 12 bits (001101000101) equal 0x345 (837 decimal) -> Page offset
>
> The system register "SYSEXAMPLE1" points to the 4Kib aligned page directory (lets say its in
> page 5) so to find the page table we need to load the 4byte (32bit) entry from address:
> (4096*5) + (4*0) i.e. the 1st (zero) Page Directory Index (see above) in the 4KiB page directory in page 5.
>
> This (as all subsequent memory accesses) may or may not be cached so could miss to RAM - but this is irrelevant
> to our discussion so for this example just assume all cache levels have just been flushed and we are always
> accessing RAM. Next we extract the relevant 20bits from this PDE (the other 12 can be used for access/permission
> flags etc) to obtain the 4Kib aligned page table (lets say its in page 50 decimal) so we then read the
> 4 byte entry starting from address (given by Page Table Index 18 from above):
> (4096*50) + (4*18) and again use 20bits as a page address (this time for the data
> page itself). Lets say the PTE says the data page is at page 100 decimal. So the
> final physical address is (4096*100) + (837 i.e. the page offset) = 410437.
>
> So virtual address 0x12345 (74565 decimal) is actually at physical address 0x64345 (410437).
> We had to hit memory 3 times JUST TO CALCULATE THE ADDRESS! This is very slow and so we
> don't have to do it again we cache the result in our TLB (Entry 1 : 0x12000-> 0x64000 note
> bottom 12bits are not needed as we are working with 4Kib aligned 4Kib sized pages).
> Now we can start to actually issue the read to address 0x64345 to the cache memory hierarchy.
>
> What did you notice about the role of the caches in this process? Apart from possibly caching a few page
> tables they had no role what so ever! Also until the AGU/Load unit has completed the translation the caches
> are not even aware of what we are asking for (they are indexed by the physical address just like RAM) so
> the TLBs have not had a direct bearing on the caches either (although the page walker issues PTE loads
> via them i.e. it is a client of the caches itself). The TLBs and caches are two totally unrelated structures
> on our chip. So if you are asking what TLBs cache level1/2/whatever has the answer is ALWAYS NONE as caches
> don't own the TLBs the load/store unit does. i.e. TLB access is upstream of cache access in the memory
> pipeline. All the caches in this chip are physically indexed so do not care what goes into the TLB at all,
> only the resulting physical address is used in cache indexing, memory addressing etc. That means the caches
> have no need of the TLB data to track their contents or read/flush to RAM.
>
> ------------------------------------------------------------------------------------
> Now if you actually get that (and yes page walking is incredibly inefficient that's why we have TLBs) then
> we can consider a virtually indexed, physically tagged L1 cache. Caches are composed of fixed sized caches
> lines arranged in sets of ways. Consider a virtually indexed, 64 byte line, 32KiB, 8-way cache.
>
> There are 32768/64 = 512 cache lines in our cache
> The are 512/8 = 64 sets in our cache of 8 ways each
>
> So the low 6bits (0-5) of an address are just the byte offset in the 64byte cache line.
> The next 6bits (6-11) give us our set. Each set has 8 possible cache lines for us to go in and these
> are managed in a pseudo LRU type algorithm. Notice we need only the low 12bits of the address (same
> as the 4kiB page offset above - this is not an accident) to find our place in the cache. To determine
> the correct way (since many different addresses have the same bits 6-11) we need to check the TAG held
> against the cache line as this holds the full physical address (minus the 12bits covered above) so
> we can see which way (if any) holds our data by checking the 8 TAGs in our set in parallel.
>
> TLBs are also often either arranged an N-way CAM type memory or in sets and ways like a cache and can
> often take nearly as long to lookup as a L1 cache fetch. So rather than index the cache with the physical
> address from the TLB we can index the L1 with the pre-translated virtual address so we can start it in
> parallel with the TLB lookup. Note we still need to get the physical address back to do the final compare
> with the TAG but this is an optimisation that can save a few clocks load to use latency (if we get a TLB
> hit). However, it only makes sense for the L1 since we already have the physical address before we need
> to query the L2 (i.e. a miss in L1) so there is no advantage in making the L2/3 virtually indexed. So
> please don't confuse this clever "cheat" as some kind of TLBcache association, it isn't.
>
> Ideally a CPU should have enough TLBs to completely cover all its caches but this is not always the case
> since as you make the TLB bigger it gets slower. So often you have a second larger and slower (L2) TLB that
> is queried on a miss to the L1 TLB (probably in parallel to the page walker starting up). Also you may choose
> to cache PDE entries (i.e. page table addresses) in your TLB or even in a special cache for the page walker.
> But all of this is implementation detail and doesn't really change the core algorithm.
>
Oh boy... Well that is certainly a LOT to take in... Let's see if I can recite what you guys said in my own words and understanding. There are still some "blanks" though... Perhaps you guys can help fill them in with me?
So an instruction asks the LSU to get ahold of "#12345678", and it asks the LSU to translate the virtual address into a physical address (here is a blank. How does it do this? What unit translates the virtual address into a physical address, and how?), but at the same time, the LSU can also try loading a line (a line? Heres one of my other blanks; (I have 4 in total. A lot I know, sorry; I'm not the sharpest tool in the shed, but I try to be ':) ) Why does taking one line and comparing it to the translated physical address the equivalent of comparing it to the entire L1 cache? Isnt the L1 cache larger than one line?) and then compares the physical address to the cache line, and if they match, then the data is found and the translation goes to the TLB.
If it is not found, then it moves to the L2 cache and uses the physical address there (heres where two of my other blanks comes in. What does it use to look through the L2 and L3 cache? The page walker? Also, if the physical address is already found, why does the LSU have to bother going from L1>L2>L3>RAM when it can just go directly to where the physical address is since it already has it? Or does the physical address have no notation of where it is in the physical memory hiearchy?) and if it finds the info, then the data is found; if not, it moves onto the L3.
The process is repeated and if the data is not found, it moves to RAM.
And heres my last blank. It doesnt do the same process in the RAM; does it? Does it use the page tables to look up whats in RAM? Im now confused as to what the page tables do...
Sorry for probably missing a lot of what you guys said; I'm a little slow, so please bare with me here. Thanks for the great help!