By: rwessel (robertwessel.delete@this.yahoo.com), August 4, 2013 10:58 pm
Room: Moderated Discussions
Sebastian Soeiro (sebastian_2896.delete@this.hotmail.com) on August 4, 2013 5:42 pm wrote:
> Melody Speer (pseudo.delete@this.nym.net) on August 4, 2013 4:11 am wrote:
> > Sebastian Soeiro (sebastian_2896.delete@this.hotmail.com) on August 3, 2013 8:23 am wrote:
> > > So an instruction asks the LSU to get ahold of "#12345678",
> > > and it asks the LSU to translate the virtual address
> > > into a physical address (here is a blank. How does it do
> > > this? What unit translates the virtual address into
> > > a physical address, and how?)
> >
> > Er... no. The LSU is part of the CPU core proper, and it has two jobs:
> >
> > 1. Compute the address #12345678
> > 2. Stall the processor until the result is ready.
> >
> > The address #12345678 is then sent to two places. The lower 12 bits (#678)
> > are sent to the L1 cache and used to look up several possible cache lines.
> >
> > At the same time, the upper 20 bits (#12345) are sent to the TLB. I'm going to assume a
> > processor with a 64 GB physical address space (like 32-bit x86 processors with PAE), so
> > if we get lucky and get a TLB hit, we discover that this corresponds to the physical address
> > #abcdef. (If we get a TLB miss, we need to do a page table walk and try again.)
> >
> > Then we take those upper 24 physical address bits and compare them to the set of candidates found in the
> > L1 cache. If we are lucky, one of them matches (#abcdef789 is found in the L1 cache) and we're done.
> >
> > If the L1 cache misses, we have the full physical address #abcde789 to
> > send to higher levels in the cache hierarchy: L2, L3, and main memory.
> >
> >
> > A typical L1 cache is made up of 64-byte cache lines. In addition to the data,
> > each line has a "cache tag": upper address bits (24 bits in my example), a validity
> > bit, a dirty bit, and possibly some LRU bits to control cache replacement.
> >
> > 64 of those lines (4096 bytes) make up a cache way. These are direct-mapped: line 0 starts
> > at address 0, line 1 starts at address 64, line 2 starts at address 128, and so on. Line
> > 63 starts at address 4032. Thus, for any given low address, there's only one possible cache
> > line that could contain that data. (Our example #789 is line 30, which covers addresses #780
> > through #7bf.) You can fetch that line and check the high-order bits for a match.
> >
> > Then (again, I'm using typical numbers) 8 of those ways make up the full 32K L1 cache. When
> > given an address, the 8 ways find a candidate cache line in parallel. When they're done,
> > the TLB has the upper address bits, and those are checked against the 8 candidates to find
> > the correct one. Then the lowest-order bits choose the correct part of that line.
> >
> > L1 cache latency is critical and there are a huge number of tricks that have been tried
> > to speed it up. For example, the L1 cache may return some data to the LSU saying "I think
> > this is the data you want, but I may be wrong". The CPU can then start processing with
> > that data, but back up and retry if it turns out that the cache's first guess was wrong.
>
> Wow those were definitely some... Amazing reads. I may have bitten off a little more than I can chew.
> I read through the explanations a handful of times each and think I have a fair grip on it, but still
> some questions (as usual. I apologize for missing probably very, very obvious information.)
>
> So, if I understand correctly (this is with address translation ON as well as with a TLB, as
> it makes most sense to learn with these intricacies included as most modern CPUs have them);
>
> An instruction requests from the LSU an line of data (I suppose we'll use data for the example?), the
> request is handed along with a "challenge" made up in order to calculate the virtual address.
I'd not call it a challenge, rather we compute an index into the cache from some of the address bits.
> At this point in time, the LSU does two things simultaneously; it will search both the TLB for an entry of when
> the translation has already been done in the past, and it will also search the L1 cache itself for the tag corresponding
> to the line of data needed; this can vary depending on if the cache is direct mapped, fully associative, or set-way
> associative. (the explanations on those things were top-notch, might I add. Thank you!)
I'd say it was searching for a cache entry with a tag matching the required address.
The cache is ultimately a kind of memory. The index to the line in the cache is computed, then the entire line is read from the cache, and then the tag for each entry in that line is checked against the desired address. If one of them matches, you have a hit.
> If there is a TLB hit, then the translation is brought in along with the cache line if the tag is found
> and they are compared. If they match, the data is found. If there is a cache miss, the translation found
> in the TLB is used to search the other caches until the data is found. If there is a TLB miss, it wont
> matter if there is an L1 hit. If there is a TLB miss, the challenge is given to the AGU which calculates
> the challenge into a physical address, and that physical address is used to find the data entry.
No, the AGU is usually the thing that figures out what logical address the instruction is requesting. In the example we're using, the AGU computes that "25(ebx,4*edx)" is pointing to logical address 0x12345678. On some processors, it may do some additional duties (on x86, for example, the AGU is what usually handles segmentation, but it's still producing the required logical address).
If we're running DAT of, the logical address can be used directly as a physical address.
If we're running DAT on, the logical address is a virtual address, and must be translated, via the page translation mechanism, to a physical address. In DAT on mode only, the virtual address is given to the TLB for translation to a physical address, or if that fails is passed to the page table walker (in whatever form that exists) to generate the translation.
> I know I got atleast one thing wrong as I see a "blank". Specifically;
>
> rwessel said in this quote: "Now if we go back to the LSU, and instead we’re running with DAT
> on. Now the LSU generates logical address 0x12345678, and since we’re running DAT on, it’s
> actually a virtual address. So the LSU it passes the virtual address to the address translation
> unit, which translates it to a physical address. Let’s say it ends up 0x88888678."
>
> Now, isnt the challenge (when I refer to the challenge, I mean ""mov eax,25(ebx,4*edx)"") supposed
> to be directly given to the AGU? Or does the LSU change the challenge into a virtual address and the
> AGU calculates it into a physical address (with no calculations though? I don't understand.)
Again, the AGU (acknowledging that the terminology here is not quite standardized) does the "25(ebx,4*edx)" to 0x12345678 thing. Some later stage in the LSU (again, there's not a formal set of boundaries) is what translates that logical address to a physical address (if we're running DAT on). After that happen is when you start to hit caches (with the exception of the overlap done in many L1 caches).
> Also, heres on bit of information I'm missing that is probably dreadfully obvious but...
> What is a "cache line"? It is an entry or a part of an entry? I don't understand that.
The definition makes the most sense for a set-associative cache. A cache line is what's referred to by the index used to do the first part of the lookup in the cache. A cache line contains several (again, ignoring direct mapped or fully associative caches) cache entries. Let's say the we're using the scheme previously described, where the index is the seventh through twelfth lowest bits of the address. IOW, if we have a 32 bit (physical) address we can consider it split as follows (in bits):
ttttttttttttttttiiiiiioooooo
The oooooooo is the offset within the 64 bytes cache entry. The iiiiii is the index into the cache, and the ttt..ttt is the sixteen bit tag comparator.
Thus any address with the same iiiiii bits will index to the same cache line. If we allow more than one cache entry in a cache line we have a N-way set associative cache. So if we have a four way set associative cache, each cache line will contain four cache entries, each of which can hold a different block of 64 bytes (although all the blocks/entries will have the same index).
If we're talking about caches in generic terms (from a program visible perspective), we often use cache line as a synonym for cache entry.
> Lastly, just to verify, the page walker is the component that does the "searching"
> with the physical address in the event of a TLB miss, correct?
The page walker searches *for* the physical address in the page tables, given a virtual address as input. It should generate the same result as a TLB hit (which also translates a virtual address to a physical address), but does the process the hard way - instead of just looking at recent translation to see if we've done it already (which is what the TLB does), the page walker has to step through the actual page tables to get the result.
> Melody Speer (pseudo.delete@this.nym.net) on August 4, 2013 4:11 am wrote:
> > Sebastian Soeiro (sebastian_2896.delete@this.hotmail.com) on August 3, 2013 8:23 am wrote:
> > > So an instruction asks the LSU to get ahold of "#12345678",
> > > and it asks the LSU to translate the virtual address
> > > into a physical address (here is a blank. How does it do
> > > this? What unit translates the virtual address into
> > > a physical address, and how?)
> >
> > Er... no. The LSU is part of the CPU core proper, and it has two jobs:
> >
> > 1. Compute the address #12345678
> > 2. Stall the processor until the result is ready.
> >
> > The address #12345678 is then sent to two places. The lower 12 bits (#678)
> > are sent to the L1 cache and used to look up several possible cache lines.
> >
> > At the same time, the upper 20 bits (#12345) are sent to the TLB. I'm going to assume a
> > processor with a 64 GB physical address space (like 32-bit x86 processors with PAE), so
> > if we get lucky and get a TLB hit, we discover that this corresponds to the physical address
> > #abcdef. (If we get a TLB miss, we need to do a page table walk and try again.)
> >
> > Then we take those upper 24 physical address bits and compare them to the set of candidates found in the
> > L1 cache. If we are lucky, one of them matches (#abcdef789 is found in the L1 cache) and we're done.
> >
> > If the L1 cache misses, we have the full physical address #abcde789 to
> > send to higher levels in the cache hierarchy: L2, L3, and main memory.
> >
> >
> > A typical L1 cache is made up of 64-byte cache lines. In addition to the data,
> > each line has a "cache tag": upper address bits (24 bits in my example), a validity
> > bit, a dirty bit, and possibly some LRU bits to control cache replacement.
> >
> > 64 of those lines (4096 bytes) make up a cache way. These are direct-mapped: line 0 starts
> > at address 0, line 1 starts at address 64, line 2 starts at address 128, and so on. Line
> > 63 starts at address 4032. Thus, for any given low address, there's only one possible cache
> > line that could contain that data. (Our example #789 is line 30, which covers addresses #780
> > through #7bf.) You can fetch that line and check the high-order bits for a match.
> >
> > Then (again, I'm using typical numbers) 8 of those ways make up the full 32K L1 cache. When
> > given an address, the 8 ways find a candidate cache line in parallel. When they're done,
> > the TLB has the upper address bits, and those are checked against the 8 candidates to find
> > the correct one. Then the lowest-order bits choose the correct part of that line.
> >
> > L1 cache latency is critical and there are a huge number of tricks that have been tried
> > to speed it up. For example, the L1 cache may return some data to the LSU saying "I think
> > this is the data you want, but I may be wrong". The CPU can then start processing with
> > that data, but back up and retry if it turns out that the cache's first guess was wrong.
>
> Wow those were definitely some... Amazing reads. I may have bitten off a little more than I can chew.
> I read through the explanations a handful of times each and think I have a fair grip on it, but still
> some questions (as usual. I apologize for missing probably very, very obvious information.)
>
> So, if I understand correctly (this is with address translation ON as well as with a TLB, as
> it makes most sense to learn with these intricacies included as most modern CPUs have them);
>
> An instruction requests from the LSU an line of data (I suppose we'll use data for the example?), the
> request is handed along with a "challenge" made up in order to calculate the virtual address.
I'd not call it a challenge, rather we compute an index into the cache from some of the address bits.
> At this point in time, the LSU does two things simultaneously; it will search both the TLB for an entry of when
> the translation has already been done in the past, and it will also search the L1 cache itself for the tag corresponding
> to the line of data needed; this can vary depending on if the cache is direct mapped, fully associative, or set-way
> associative. (the explanations on those things were top-notch, might I add. Thank you!)
I'd say it was searching for a cache entry with a tag matching the required address.
The cache is ultimately a kind of memory. The index to the line in the cache is computed, then the entire line is read from the cache, and then the tag for each entry in that line is checked against the desired address. If one of them matches, you have a hit.
> If there is a TLB hit, then the translation is brought in along with the cache line if the tag is found
> and they are compared. If they match, the data is found. If there is a cache miss, the translation found
> in the TLB is used to search the other caches until the data is found. If there is a TLB miss, it wont
> matter if there is an L1 hit. If there is a TLB miss, the challenge is given to the AGU which calculates
> the challenge into a physical address, and that physical address is used to find the data entry.
No, the AGU is usually the thing that figures out what logical address the instruction is requesting. In the example we're using, the AGU computes that "25(ebx,4*edx)" is pointing to logical address 0x12345678. On some processors, it may do some additional duties (on x86, for example, the AGU is what usually handles segmentation, but it's still producing the required logical address).
If we're running DAT of, the logical address can be used directly as a physical address.
If we're running DAT on, the logical address is a virtual address, and must be translated, via the page translation mechanism, to a physical address. In DAT on mode only, the virtual address is given to the TLB for translation to a physical address, or if that fails is passed to the page table walker (in whatever form that exists) to generate the translation.
> I know I got atleast one thing wrong as I see a "blank". Specifically;
>
> rwessel said in this quote: "Now if we go back to the LSU, and instead we’re running with DAT
> on. Now the LSU generates logical address 0x12345678, and since we’re running DAT on, it’s
> actually a virtual address. So the LSU it passes the virtual address to the address translation
> unit, which translates it to a physical address. Let’s say it ends up 0x88888678."
>
> Now, isnt the challenge (when I refer to the challenge, I mean ""mov eax,25(ebx,4*edx)"") supposed
> to be directly given to the AGU? Or does the LSU change the challenge into a virtual address and the
> AGU calculates it into a physical address (with no calculations though? I don't understand.)
Again, the AGU (acknowledging that the terminology here is not quite standardized) does the "25(ebx,4*edx)" to 0x12345678 thing. Some later stage in the LSU (again, there's not a formal set of boundaries) is what translates that logical address to a physical address (if we're running DAT on). After that happen is when you start to hit caches (with the exception of the overlap done in many L1 caches).
> Also, heres on bit of information I'm missing that is probably dreadfully obvious but...
> What is a "cache line"? It is an entry or a part of an entry? I don't understand that.
The definition makes the most sense for a set-associative cache. A cache line is what's referred to by the index used to do the first part of the lookup in the cache. A cache line contains several (again, ignoring direct mapped or fully associative caches) cache entries. Let's say the we're using the scheme previously described, where the index is the seventh through twelfth lowest bits of the address. IOW, if we have a 32 bit (physical) address we can consider it split as follows (in bits):
ttttttttttttttttiiiiiioooooo
The oooooooo is the offset within the 64 bytes cache entry. The iiiiii is the index into the cache, and the ttt..ttt is the sixteen bit tag comparator.
Thus any address with the same iiiiii bits will index to the same cache line. If we allow more than one cache entry in a cache line we have a N-way set associative cache. So if we have a four way set associative cache, each cache line will contain four cache entries, each of which can hold a different block of 64 bytes (although all the blocks/entries will have the same index).
If we're talking about caches in generic terms (from a program visible perspective), we often use cache line as a synonym for cache entry.
> Lastly, just to verify, the page walker is the component that does the "searching"
> with the physical address in the event of a TLB miss, correct?
The page walker searches *for* the physical address in the page tables, given a virtual address as input. It should generate the same result as a TLB hit (which also translates a virtual address to a physical address), but does the process the hard way - instead of just looking at recent translation to see if we've done it already (which is what the TLB does), the page walker has to step through the actual page tables to get the result.