By: Antti-Ville Tuunainen (avtuunainen.delete@this.gmail.com), August 2, 2013 1:09 pm
Room: Moderated Discussions
Sebastian Soeiro (sebastian_2896.delete@this.hotmail.com) on August 2, 2013 12:58 pm wrote:
> Thank you very much for your explanation! However, there still seems to be some
> unclear things that you didnt answer or I don't understand the answer to;
>
> - It is still unclear as to whether there are multiple TLBs in CPUs or not; with one TLB being
> tightly tied to the L1 cache and the other one being used as a "general purpose" TLB.
>
> - If ALL page tables are in memory, does that mean the CPU must go to the RAM to check the
> page table of the L2 cache? That seems INCREDIBLY inefficient... Shouldnt the L2 cache's
> page table be close to the L2 cache itself? That seems most logical and efficient...
>
> Looking forward to your answer as always!
You are confused about how memory accesses work. I'm going to walk you through a single memory access that hits tlb and misses all the caches on a Intel SNB cpu.
The instruction starts from the LSU. The address in it is a virtual address. In order to fetch from memory or caches, this virtual address needs to be turned into a physical address. However, L1 uses a neat trick that means that the fetch can *begin* before the address is resolved, that was previously explained.
So, what happens is that the data TLB is asked for "what is the physical address corresponding to #12345678", and it returns an answer. After answering, this answer is compared with the L1 line to confirm if it matches. It doesn't, so we move on to L2 cache.
Now, to fetch from L2, we need a physical address, *but we already have one*. Since it's completely impossible to access the L2 cache without going through the L1, there is no need for a TLB for the L2 cache -- whenever you are at the point of asking L2 cache, you already have a physical address. The L2 receives the address and tells it's a miss. Same happens to the L3, and we finally look the value up in main ram.
Most modern CPUs have two distinct TLB "stacks", one for instructions and one for data, as this allows them to be tightly coupled to their respective L1 caches.
Now, what if there's a TLB miss, and a L1 cache hit?
We start again from the lsu. L1 access and TLB access are fired in parallel. TLB returns no hit. Then, the second level of TLB is probed. Again no hit. At this point, the page table walker is fired up. It starts by fetching the relevant entry of the top-level page table, using the address for it in a special CPU register, indexed by the relevant bits in the virtual address we want to fetch. This address is a physical address, so it doesn't go through the TLB. It's crucial to understand that the page tables are just data in ram, just like all the rest on your machine. If it's often touched, like the top level page table, it's most likely in a near cache. So in this case, we find the relevant entry of the top level page table in L1d cache.
This entry contains, among other things, whether the rough area of memory we want to look at is mapped and whether we can access it. Most importantly, it provides us a base for our next page table lookup, again indexed by the relevant bits of the virtual address. So, we fetch the next page table entry. Again, it's just a physical address to ram, it could be anywhere. It could be in our cache, it could be in the main ram, it could be in the cache of another cpu on the system, and if your OS writer is particularly insane, it might as well be an exposed register in a memory-mapped PCI device. (All real operating systems only really use memory for page tables.) In this case, we find it in main ram.
For this example, we are using 3-level page tables, and the last level is also in the main ram, so we had to wait for L1 access+ main ram access + main ram access = €#%ing forever to touch a piece of data that was in the L1 cache. This is why having good TLBs with great hit rates is critical for good performance on modern machines.
> Thank you very much for your explanation! However, there still seems to be some
> unclear things that you didnt answer or I don't understand the answer to;
>
> - It is still unclear as to whether there are multiple TLBs in CPUs or not; with one TLB being
> tightly tied to the L1 cache and the other one being used as a "general purpose" TLB.
>
> - If ALL page tables are in memory, does that mean the CPU must go to the RAM to check the
> page table of the L2 cache? That seems INCREDIBLY inefficient... Shouldnt the L2 cache's
> page table be close to the L2 cache itself? That seems most logical and efficient...
>
> Looking forward to your answer as always!
You are confused about how memory accesses work. I'm going to walk you through a single memory access that hits tlb and misses all the caches on a Intel SNB cpu.
The instruction starts from the LSU. The address in it is a virtual address. In order to fetch from memory or caches, this virtual address needs to be turned into a physical address. However, L1 uses a neat trick that means that the fetch can *begin* before the address is resolved, that was previously explained.
So, what happens is that the data TLB is asked for "what is the physical address corresponding to #12345678", and it returns an answer. After answering, this answer is compared with the L1 line to confirm if it matches. It doesn't, so we move on to L2 cache.
Now, to fetch from L2, we need a physical address, *but we already have one*. Since it's completely impossible to access the L2 cache without going through the L1, there is no need for a TLB for the L2 cache -- whenever you are at the point of asking L2 cache, you already have a physical address. The L2 receives the address and tells it's a miss. Same happens to the L3, and we finally look the value up in main ram.
Most modern CPUs have two distinct TLB "stacks", one for instructions and one for data, as this allows them to be tightly coupled to their respective L1 caches.
Now, what if there's a TLB miss, and a L1 cache hit?
We start again from the lsu. L1 access and TLB access are fired in parallel. TLB returns no hit. Then, the second level of TLB is probed. Again no hit. At this point, the page table walker is fired up. It starts by fetching the relevant entry of the top-level page table, using the address for it in a special CPU register, indexed by the relevant bits in the virtual address we want to fetch. This address is a physical address, so it doesn't go through the TLB. It's crucial to understand that the page tables are just data in ram, just like all the rest on your machine. If it's often touched, like the top level page table, it's most likely in a near cache. So in this case, we find the relevant entry of the top level page table in L1d cache.
This entry contains, among other things, whether the rough area of memory we want to look at is mapped and whether we can access it. Most importantly, it provides us a base for our next page table lookup, again indexed by the relevant bits of the virtual address. So, we fetch the next page table entry. Again, it's just a physical address to ram, it could be anywhere. It could be in our cache, it could be in the main ram, it could be in the cache of another cpu on the system, and if your OS writer is particularly insane, it might as well be an exposed register in a memory-mapped PCI device. (All real operating systems only really use memory for page tables.) In this case, we find it in main ram.
For this example, we are using 3-level page tables, and the last level is also in the main ram, so we had to wait for L1 access+ main ram access + main ram access = €#%ing forever to touch a piece of data that was in the L1 cache. This is why having good TLBs with great hit rates is critical for good performance on modern machines.