By: , August 4, 2013 5:42 pm
Room: Moderated Discussions
Melody Speer (pseudo.delete@this.nym.net) on August 4, 2013 4:11 am wrote:
> Sebastian Soeiro (sebastian_2896.delete@this.hotmail.com) on August 3, 2013 8:23 am wrote:
> > So an instruction asks the LSU to get ahold of "#12345678",
> > and it asks the LSU to translate the virtual address
> > into a physical address (here is a blank. How does it do
> > this? What unit translates the virtual address into
> > a physical address, and how?)
>
> Er... no. The LSU is part of the CPU core proper, and it has two jobs:
>
> 1. Compute the address #12345678
> 2. Stall the processor until the result is ready.
>
> The address #12345678 is then sent to two places. The lower 12 bits (#678)
> are sent to the L1 cache and used to look up several possible cache lines.
>
> At the same time, the upper 20 bits (#12345) are sent to the TLB. I'm going to assume a
> processor with a 64 GB physical address space (like 32-bit x86 processors with PAE), so
> if we get lucky and get a TLB hit, we discover that this corresponds to the physical address
> #abcdef. (If we get a TLB miss, we need to do a page table walk and try again.)
>
> Then we take those upper 24 physical address bits and compare them to the set of candidates found in the
> L1 cache. If we are lucky, one of them matches (#abcdef789 is found in the L1 cache) and we're done.
>
> If the L1 cache misses, we have the full physical address #abcde789 to
> send to higher levels in the cache hierarchy: L2, L3, and main memory.
>
>
> A typical L1 cache is made up of 64-byte cache lines. In addition to the data,
> each line has a "cache tag": upper address bits (24 bits in my example), a validity
> bit, a dirty bit, and possibly some LRU bits to control cache replacement.
>
> 64 of those lines (4096 bytes) make up a cache way. These are direct-mapped: line 0 starts
> at address 0, line 1 starts at address 64, line 2 starts at address 128, and so on. Line
> 63 starts at address 4032. Thus, for any given low address, there's only one possible cache
> line that could contain that data. (Our example #789 is line 30, which covers addresses #780
> through #7bf.) You can fetch that line and check the high-order bits for a match.
>
> Then (again, I'm using typical numbers) 8 of those ways make up the full 32K L1 cache. When
> given an address, the 8 ways find a candidate cache line in parallel. When they're done,
> the TLB has the upper address bits, and those are checked against the 8 candidates to find
> the correct one. Then the lowest-order bits choose the correct part of that line.
>
> L1 cache latency is critical and there are a huge number of tricks that have been tried
> to speed it up. For example, the L1 cache may return some data to the LSU saying "I think
> this is the data you want, but I may be wrong". The CPU can then start processing with
> that data, but back up and retry if it turns out that the cache's first guess was wrong.
Wow those were definitely some... Amazing reads. I may have bitten off a little more than I can chew. I read through the explanations a handful of times each and think I have a fair grip on it, but still some questions (as usual. I apologize for missing probably very, very obvious information.)
So, if I understand correctly (this is with address translation ON as well as with a TLB, as it makes most sense to learn with these intricacies included as most modern CPUs have them);
An instruction requests from the LSU an line of data (I suppose we'll use data for the example?), the request is handed along with a "challenge" made up in order to calculate the virtual address.
At this point in time, the LSU does two things simultaneously; it will search both the TLB for an entry of when the translation has already been done in the past, and it will also search the L1 cache itself for the tag corresponding to the line of data needed; this can vary depending on if the cache is direct mapped, fully associative, or set-way associative. (the explanations on those things were top-notch, might I add. Thank you!)
If there is a TLB hit, then the translation is brought in along with the cache line if the tag is found and they are compared. If they match, the data is found. If there is a cache miss, the translation found in the TLB is used to search the other caches until the data is found. If there is a TLB miss, it wont matter if there is an L1 hit. If there is a TLB miss, the challenge is given to the AGU which calculates the challenge into a physical address, and that physical address is used to find the data entry.
I know I got atleast one thing wrong as I see a "blank". Specifically;
rwessel said in this quote: "Now if we go back to the LSU, and instead we’re running with DAT on. Now the LSU generates logical address 0x12345678, and since we’re running DAT on, it’s actually a virtual address. So the LSU it passes the virtual address to the address translation unit, which translates it to a physical address. Let’s say it ends up 0x88888678."
Now, isnt the challenge (when I refer to the challenge, I mean ""mov eax,25(ebx,4*edx)"") supposed to be directly given to the AGU? Or does the LSU change the challenge into a virtual address and the AGU calculates it into a physical address (with no calculations though? I don't understand.)
Also, heres on bit of information I'm missing that is probably dreadfully obvious but... What is a "cache line"? It is an entry or a part of an entry? I don't understand that.
Lastly, just to verify, the page walker is the component that does the "searching" with the physical address in the event of a TLB miss, correct?
Thank you very much for your extremely detailed information. Things are turning out to be extremely different from what I originally thought they were. Simpler in some ways and far more complex in others.
> Sebastian Soeiro (sebastian_2896.delete@this.hotmail.com) on August 3, 2013 8:23 am wrote:
> > So an instruction asks the LSU to get ahold of "#12345678",
> > and it asks the LSU to translate the virtual address
> > into a physical address (here is a blank. How does it do
> > this? What unit translates the virtual address into
> > a physical address, and how?)
>
> Er... no. The LSU is part of the CPU core proper, and it has two jobs:
>
> 1. Compute the address #12345678
> 2. Stall the processor until the result is ready.
>
> The address #12345678 is then sent to two places. The lower 12 bits (#678)
> are sent to the L1 cache and used to look up several possible cache lines.
>
> At the same time, the upper 20 bits (#12345) are sent to the TLB. I'm going to assume a
> processor with a 64 GB physical address space (like 32-bit x86 processors with PAE), so
> if we get lucky and get a TLB hit, we discover that this corresponds to the physical address
> #abcdef. (If we get a TLB miss, we need to do a page table walk and try again.)
>
> Then we take those upper 24 physical address bits and compare them to the set of candidates found in the
> L1 cache. If we are lucky, one of them matches (#abcdef789 is found in the L1 cache) and we're done.
>
> If the L1 cache misses, we have the full physical address #abcde789 to
> send to higher levels in the cache hierarchy: L2, L3, and main memory.
>
>
> A typical L1 cache is made up of 64-byte cache lines. In addition to the data,
> each line has a "cache tag": upper address bits (24 bits in my example), a validity
> bit, a dirty bit, and possibly some LRU bits to control cache replacement.
>
> 64 of those lines (4096 bytes) make up a cache way. These are direct-mapped: line 0 starts
> at address 0, line 1 starts at address 64, line 2 starts at address 128, and so on. Line
> 63 starts at address 4032. Thus, for any given low address, there's only one possible cache
> line that could contain that data. (Our example #789 is line 30, which covers addresses #780
> through #7bf.) You can fetch that line and check the high-order bits for a match.
>
> Then (again, I'm using typical numbers) 8 of those ways make up the full 32K L1 cache. When
> given an address, the 8 ways find a candidate cache line in parallel. When they're done,
> the TLB has the upper address bits, and those are checked against the 8 candidates to find
> the correct one. Then the lowest-order bits choose the correct part of that line.
>
> L1 cache latency is critical and there are a huge number of tricks that have been tried
> to speed it up. For example, the L1 cache may return some data to the LSU saying "I think
> this is the data you want, but I may be wrong". The CPU can then start processing with
> that data, but back up and retry if it turns out that the cache's first guess was wrong.
Wow those were definitely some... Amazing reads. I may have bitten off a little more than I can chew. I read through the explanations a handful of times each and think I have a fair grip on it, but still some questions (as usual. I apologize for missing probably very, very obvious information.)
So, if I understand correctly (this is with address translation ON as well as with a TLB, as it makes most sense to learn with these intricacies included as most modern CPUs have them);
An instruction requests from the LSU an line of data (I suppose we'll use data for the example?), the request is handed along with a "challenge" made up in order to calculate the virtual address.
At this point in time, the LSU does two things simultaneously; it will search both the TLB for an entry of when the translation has already been done in the past, and it will also search the L1 cache itself for the tag corresponding to the line of data needed; this can vary depending on if the cache is direct mapped, fully associative, or set-way associative. (the explanations on those things were top-notch, might I add. Thank you!)
If there is a TLB hit, then the translation is brought in along with the cache line if the tag is found and they are compared. If they match, the data is found. If there is a cache miss, the translation found in the TLB is used to search the other caches until the data is found. If there is a TLB miss, it wont matter if there is an L1 hit. If there is a TLB miss, the challenge is given to the AGU which calculates the challenge into a physical address, and that physical address is used to find the data entry.
I know I got atleast one thing wrong as I see a "blank". Specifically;
rwessel said in this quote: "Now if we go back to the LSU, and instead we’re running with DAT on. Now the LSU generates logical address 0x12345678, and since we’re running DAT on, it’s actually a virtual address. So the LSU it passes the virtual address to the address translation unit, which translates it to a physical address. Let’s say it ends up 0x88888678."
Now, isnt the challenge (when I refer to the challenge, I mean ""mov eax,25(ebx,4*edx)"") supposed to be directly given to the AGU? Or does the LSU change the challenge into a virtual address and the AGU calculates it into a physical address (with no calculations though? I don't understand.)
Also, heres on bit of information I'm missing that is probably dreadfully obvious but... What is a "cache line"? It is an entry or a part of an entry? I don't understand that.
Lastly, just to verify, the page walker is the component that does the "searching" with the physical address in the event of a TLB miss, correct?
Thank you very much for your extremely detailed information. Things are turning out to be extremely different from what I originally thought they were. Simpler in some ways and far more complex in others.