By: rwessel (robertwessel.delete@this.yahoo.com), August 8, 2013 1:03 pm
Room: Moderated Discussions
Sebastian Soeiro (sebastian_2896.delete@this.hotmail.com) on August 7, 2013 3:33 pm wrote:
> rwessel (robertwessel.delete@this.yahoo.com) on August 4, 2013 10:58 pm wrote:
> > Sebastian Soeiro (sebastian_2896.delete@this.hotmail.com) on August 4, 2013 5:42 pm wrote:
> > > Melody Speer (pseudo.delete@this.nym.net) on August 4, 2013 4:11 am wrote:
> > > > Sebastian Soeiro (sebastian_2896.delete@this.hotmail.com) on August 3, 2013 8:23 am wrote:
> > > > > So an instruction asks the LSU to get ahold of "#12345678",
> > > > > and it asks the LSU to translate the virtual address
> > > > > into a physical address (here is a blank. How does it do
> > > > > this? What unit translates the virtual address into
> > > > > a physical address, and how?)
> > > >
> > > > Er... no. The LSU is part of the CPU core proper, and it has two jobs:
> > > >
> > > > 1. Compute the address #12345678
> > > > 2. Stall the processor until the result is ready.
> > > >
> > > > The address #12345678 is then sent to two places. The lower 12 bits (#678)
> > > > are sent to the L1 cache and used to look up several possible cache lines.
> > > >
> > > > At the same time, the upper 20 bits (#12345) are sent to the TLB. I'm going to assume a
> > > > processor with a 64 GB physical address space (like 32-bit x86 processors with PAE), so
> > > > if we get lucky and get a TLB hit, we discover that this corresponds to the physical address
> > > > #abcdef. (If we get a TLB miss, we need to do a page table walk and try again.)
> > > >
> > > > Then we take those upper 24 physical address bits and compare them to the set of candidates found in the
> > > > L1 cache. If we are lucky, one of them matches (#abcdef789 is found in the L1 cache) and we're done.
> > > >
> > > > If the L1 cache misses, we have the full physical address #abcde789 to
> > > > send to higher levels in the cache hierarchy: L2, L3, and main memory.
> > > >
> > > >
> > > > A typical L1 cache is made up of 64-byte cache lines. In addition to the data,
> > > > each line has a "cache tag": upper address bits (24 bits in my example), a validity
> > > > bit, a dirty bit, and possibly some LRU bits to control cache replacement.
> > > >
> > > > 64 of those lines (4096 bytes) make up a cache way. These are direct-mapped: line 0 starts
> > > > at address 0, line 1 starts at address 64, line 2 starts at address 128, and so on. Line
> > > > 63 starts at address 4032. Thus, for any given low address, there's only one possible cache
> > > > line that could contain that data. (Our example #789 is line 30, which covers addresses #780
> > > > through #7bf.) You can fetch that line and check the high-order bits for a match.
> > > >
> > > > Then (again, I'm using typical numbers) 8 of those ways make up the full 32K L1 cache. When
> > > > given an address, the 8 ways find a candidate cache line in parallel. When they're done,
> > > > the TLB has the upper address bits, and those are checked against the 8 candidates to find
> > > > the correct one. Then the lowest-order bits choose the correct part of that line.
> > > >
> > > > L1 cache latency is critical and there are a huge number of tricks that have been tried
> > > > to speed it up. For example, the L1 cache may return some data to the LSU saying "I think
> > > > this is the data you want, but I may be wrong". The CPU can then start processing with
> > > > that data, but back up and retry if it turns out that the cache's first guess was wrong.
> > >
> > > Wow those were definitely some... Amazing reads. I may have bitten off a little more than I can chew.
> > > I read through the explanations a handful of times each and think I have a fair grip on it, but still
> > > some questions (as usual. I apologize for missing probably very, very obvious information.)
> > >
> > > So, if I understand correctly (this is with address translation ON as well as with a TLB, as
> > > it makes most sense to learn with these intricacies included as most modern CPUs have them);
> > >
> > > An instruction requests from the LSU an line of data (I suppose we'll use data for the example?), the
> > > request is handed along with a "challenge" made up in order to calculate the virtual address.
> >
> >
> > I'd not call it a challenge, rather we compute an index into the cache from some of the address bits.
> >
> >
> > > At this point in time, the LSU does two things simultaneously;
> > > it will search both the TLB for an entry of when
> > > the translation has already been done in the past, and it will
> > > also search the L1 cache itself for the tag corresponding
> > > to the line of data needed; this can vary depending on if the
> > > cache is direct mapped, fully associative, or set-way
> > > associative. (the explanations on those things were top-notch, might I add. Thank you!)
> >
> >
> > I'd say it was searching for a cache entry with a tag matching the required address.
> >
> > The cache is ultimately a kind of memory. The index to the line in the cache is computed,
> > then the entire line is read from the cache, and then the tag for each entry in that line
> > is checked against the desired address. If one of them matches, you have a hit.
> >
> >
> > > If there is a TLB hit, then the translation is brought in along with the cache line if the tag is found
> > > and they are compared. If they match, the data is found. If there is a cache miss, the translation found
> > > in the TLB is used to search the other caches until the data is found. If there is a TLB miss, it wont
> > > matter if there is an L1 hit. If there is a TLB miss, the challenge is given to the AGU which calculates
> > > the challenge into a physical address, and that physical address is used to find the data entry.
> >
> >
> > No, the AGU is usually the thing that figures out what logical address the instruction is requesting.
> > In the example we're using, the AGU computes that "25(ebx,4*edx)" is pointing to logical address
> > 0x12345678. On some processors, it may do some additional duties (on x86, for example, the AGU is
> > what usually handles segmentation, but it's still producing the required logical address).
> >
> > If we're running DAT of, the logical address can be used directly as a physical address.
> >
> > If we're running DAT on, the logical address is a virtual address, and must be translated,
> > via the page translation mechanism, to a physical address. In DAT on mode only, the virtual
> > address is given to the TLB for translation to a physical address, or if that fails is passed
> > to the page table walker (in whatever form that exists) to generate the translation.
> >
> >
> > > I know I got atleast one thing wrong as I see a "blank". Specifically;
> > >
> > > rwessel said in this quote: "Now if we go back to the LSU, and instead we’re running with DAT
> > > on. Now the LSU generates logical address 0x12345678, and since we’re running DAT on, it’s
> > > actually a virtual address. So the LSU it passes the virtual address to the address translation
> > > unit, which translates it to a physical address. Let’s say it ends up 0x88888678."
> > >
> > > Now, isnt the challenge (when I refer to the challenge, I mean ""mov eax,25(ebx,4*edx)"") supposed
> > > to be directly given to the AGU? Or does the LSU change the challenge into a virtual address and the
> > > AGU calculates it into a physical address (with no calculations though? I don't understand.)
> >
> >
> > Again, the AGU (acknowledging that the terminology here is not quite standardized) does the "25(ebx,4*edx)"
> > to 0x12345678 thing. Some later stage in the LSU (again, there's not a formal set of boundaries) is
> > what translates that logical address to a physical address (if we're running DAT on). After that happen
> > is when you start to hit caches (with the exception of the overlap done in many L1 caches).
> >
> >
> > > Also, heres on bit of information I'm missing that is probably dreadfully obvious but...
> > > What is a "cache line"? It is an entry or a part of an entry? I don't understand that.
> >
> >
> > The definition makes the most sense for a set-associative cache. A cache line is what's referred to
> > by the index used to do the first part of the lookup in the cache. A cache line contains several (again,
> > ignoring direct mapped or fully associative caches) cache entries. Let's say the we're using the scheme
> > previously described, where the index is the seventh through twelfth lowest bits of the address. IOW,
> > if we have a 32 bit (physical) address we can consider it split as follows (in bits):
> >
> > ttttttttttttttttiiiiiioooooo
> >
> > The oooooooo is the offset within the 64 bytes cache entry. The iiiiii is the
> > index into the cache, and the ttt..ttt is the sixteen bit tag comparator.
> >
> > Thus any address with the same iiiiii bits will index to the same cache line. If we allow more
> > than one cache entry in a cache line we have a N-way set associative cache. So if we have a four
> > way set associative cache, each cache line will contain four cache entries, each of which can hold
> > a different block of 64 bytes (although all the blocks/entries will have the same index).
> >
> > If we're talking about caches in generic terms (from a program visible
> > perspective), we often use cache line as a synonym for cache entry.
> >
> >
> > > Lastly, just to verify, the page walker is the component that does the "searching"
> > > with the physical address in the event of a TLB miss, correct?
> >
> > The page walker searches *for* the physical address in the page tables, given a virtual
> > address as input. It should generate the same result as a TLB hit (which also translates
> > a virtual address to a physical address), but does the process the hard way - instead of
> > just looking at recent translation to see if we've done it already (which is what the TLB
> > does), the page walker has to step through the actual page tables to get the result.
> >
>
> Sorry for the late reply, I've been awfully busy lately. I really do appreciate you
> taking the time to educate me though, this is really, really interesting stuff!
>
> So when the LSU searches both the TLB and the L1 cache, it is looking at ALL the tags in the L1 cache, and
> once found, it compares the ENTIRE entry that was found in the L1 cache to the translation in the TLB? How
> often do these not match? It'd seem like they would 100% of the time... Maybe I'm not seeing something.
If you had a fully associative cache, the lookup would search all the tags (one for each cache entry) for one matching the desired address. If you had a direct-mapped cache, the index developed from the address would be used to select and read a single cache entry, and then that one entries tag would be compared. For a set-associative cache, one entire line of cache entries would be selected (and read), and then the tag for each of those cache entries would be compared.
> So basically, the AGU converts the calculations for the proposed virtual
> (or physical with DAT off?) and hands the tag off to the LSU?
No, the tag is in the cache entry - it says what part of memory is held by the cache line. It is compared to part of the address being looked up. For example, with 64 byte cache entries, you don't need to compare the low six bits (since those would just select a particular byte within the cache entry). With a direct-mapped or set-associative cache, the index does not need to be stored in the tag or compared either, since that part of the comparison happens as a side effect of the hashing process - IOW, all addresses with 010111 in the second six bits (from the low end), are going to go into cache line 39, so if you're looking for an address with 010111 in those bits, you could only ever find it in cache line 39 - so you so have to store those bits in the tag.
The AGU generates a logical address, which is then either treated as a physical address, and used as is, or as a virtual address, and then translated to a physical address.
> So basically, cache lines are a list of cache entries (the tags atleast) consisting of one
> of each entry from however large the associativeness is? That would assume that a directly
> mapped cache has a cache line with only one entry in it at a time and a fully associative
> cache would have the tag of all the entries of the cache in it at one time, correct?
A cache line is (almost always) several complete cache entries. If you had a four way set associative cache with 64 byte entries and 40 bits (five bytes) of additional info (tag, validity flags, protection bits, and whatever else the implementation decided to toss in there), a read of one cache line would actually read some 276* bytes out of the cache, all at once. It's reasonable to consider that four-way set associative cache to be a 2208 bit wide memory.
It is technically correct to consider a fully associative cache as having one line (although you would use a different physical implementation - there is not a reason to ever "read" the whole "line" in a fully associative cache, it's effectively already available), and a direct mapped cache as having cache lines with a single cache entry in them, but no one ever uses the terms like that. Similarly no one ever calls sailplanes "zero engine airplanes" (although that's certainly true).
*276 = 4*(64+5)
> So the page walker uses the virtual address to locate the physical address of the needed cache entry?
The page table walker finds the physical address associated with the virtual address. It's used only when the TLB does not already know that. In either case the resulting physical* address is used to search the cache.
*Again, ignoring the existence/possibility of virtual addressing in caches.
> rwessel (robertwessel.delete@this.yahoo.com) on August 4, 2013 10:58 pm wrote:
> > Sebastian Soeiro (sebastian_2896.delete@this.hotmail.com) on August 4, 2013 5:42 pm wrote:
> > > Melody Speer (pseudo.delete@this.nym.net) on August 4, 2013 4:11 am wrote:
> > > > Sebastian Soeiro (sebastian_2896.delete@this.hotmail.com) on August 3, 2013 8:23 am wrote:
> > > > > So an instruction asks the LSU to get ahold of "#12345678",
> > > > > and it asks the LSU to translate the virtual address
> > > > > into a physical address (here is a blank. How does it do
> > > > > this? What unit translates the virtual address into
> > > > > a physical address, and how?)
> > > >
> > > > Er... no. The LSU is part of the CPU core proper, and it has two jobs:
> > > >
> > > > 1. Compute the address #12345678
> > > > 2. Stall the processor until the result is ready.
> > > >
> > > > The address #12345678 is then sent to two places. The lower 12 bits (#678)
> > > > are sent to the L1 cache and used to look up several possible cache lines.
> > > >
> > > > At the same time, the upper 20 bits (#12345) are sent to the TLB. I'm going to assume a
> > > > processor with a 64 GB physical address space (like 32-bit x86 processors with PAE), so
> > > > if we get lucky and get a TLB hit, we discover that this corresponds to the physical address
> > > > #abcdef. (If we get a TLB miss, we need to do a page table walk and try again.)
> > > >
> > > > Then we take those upper 24 physical address bits and compare them to the set of candidates found in the
> > > > L1 cache. If we are lucky, one of them matches (#abcdef789 is found in the L1 cache) and we're done.
> > > >
> > > > If the L1 cache misses, we have the full physical address #abcde789 to
> > > > send to higher levels in the cache hierarchy: L2, L3, and main memory.
> > > >
> > > >
> > > > A typical L1 cache is made up of 64-byte cache lines. In addition to the data,
> > > > each line has a "cache tag": upper address bits (24 bits in my example), a validity
> > > > bit, a dirty bit, and possibly some LRU bits to control cache replacement.
> > > >
> > > > 64 of those lines (4096 bytes) make up a cache way. These are direct-mapped: line 0 starts
> > > > at address 0, line 1 starts at address 64, line 2 starts at address 128, and so on. Line
> > > > 63 starts at address 4032. Thus, for any given low address, there's only one possible cache
> > > > line that could contain that data. (Our example #789 is line 30, which covers addresses #780
> > > > through #7bf.) You can fetch that line and check the high-order bits for a match.
> > > >
> > > > Then (again, I'm using typical numbers) 8 of those ways make up the full 32K L1 cache. When
> > > > given an address, the 8 ways find a candidate cache line in parallel. When they're done,
> > > > the TLB has the upper address bits, and those are checked against the 8 candidates to find
> > > > the correct one. Then the lowest-order bits choose the correct part of that line.
> > > >
> > > > L1 cache latency is critical and there are a huge number of tricks that have been tried
> > > > to speed it up. For example, the L1 cache may return some data to the LSU saying "I think
> > > > this is the data you want, but I may be wrong". The CPU can then start processing with
> > > > that data, but back up and retry if it turns out that the cache's first guess was wrong.
> > >
> > > Wow those were definitely some... Amazing reads. I may have bitten off a little more than I can chew.
> > > I read through the explanations a handful of times each and think I have a fair grip on it, but still
> > > some questions (as usual. I apologize for missing probably very, very obvious information.)
> > >
> > > So, if I understand correctly (this is with address translation ON as well as with a TLB, as
> > > it makes most sense to learn with these intricacies included as most modern CPUs have them);
> > >
> > > An instruction requests from the LSU an line of data (I suppose we'll use data for the example?), the
> > > request is handed along with a "challenge" made up in order to calculate the virtual address.
> >
> >
> > I'd not call it a challenge, rather we compute an index into the cache from some of the address bits.
> >
> >
> > > At this point in time, the LSU does two things simultaneously;
> > > it will search both the TLB for an entry of when
> > > the translation has already been done in the past, and it will
> > > also search the L1 cache itself for the tag corresponding
> > > to the line of data needed; this can vary depending on if the
> > > cache is direct mapped, fully associative, or set-way
> > > associative. (the explanations on those things were top-notch, might I add. Thank you!)
> >
> >
> > I'd say it was searching for a cache entry with a tag matching the required address.
> >
> > The cache is ultimately a kind of memory. The index to the line in the cache is computed,
> > then the entire line is read from the cache, and then the tag for each entry in that line
> > is checked against the desired address. If one of them matches, you have a hit.
> >
> >
> > > If there is a TLB hit, then the translation is brought in along with the cache line if the tag is found
> > > and they are compared. If they match, the data is found. If there is a cache miss, the translation found
> > > in the TLB is used to search the other caches until the data is found. If there is a TLB miss, it wont
> > > matter if there is an L1 hit. If there is a TLB miss, the challenge is given to the AGU which calculates
> > > the challenge into a physical address, and that physical address is used to find the data entry.
> >
> >
> > No, the AGU is usually the thing that figures out what logical address the instruction is requesting.
> > In the example we're using, the AGU computes that "25(ebx,4*edx)" is pointing to logical address
> > 0x12345678. On some processors, it may do some additional duties (on x86, for example, the AGU is
> > what usually handles segmentation, but it's still producing the required logical address).
> >
> > If we're running DAT of, the logical address can be used directly as a physical address.
> >
> > If we're running DAT on, the logical address is a virtual address, and must be translated,
> > via the page translation mechanism, to a physical address. In DAT on mode only, the virtual
> > address is given to the TLB for translation to a physical address, or if that fails is passed
> > to the page table walker (in whatever form that exists) to generate the translation.
> >
> >
> > > I know I got atleast one thing wrong as I see a "blank". Specifically;
> > >
> > > rwessel said in this quote: "Now if we go back to the LSU, and instead we’re running with DAT
> > > on. Now the LSU generates logical address 0x12345678, and since we’re running DAT on, it’s
> > > actually a virtual address. So the LSU it passes the virtual address to the address translation
> > > unit, which translates it to a physical address. Let’s say it ends up 0x88888678."
> > >
> > > Now, isnt the challenge (when I refer to the challenge, I mean ""mov eax,25(ebx,4*edx)"") supposed
> > > to be directly given to the AGU? Or does the LSU change the challenge into a virtual address and the
> > > AGU calculates it into a physical address (with no calculations though? I don't understand.)
> >
> >
> > Again, the AGU (acknowledging that the terminology here is not quite standardized) does the "25(ebx,4*edx)"
> > to 0x12345678 thing. Some later stage in the LSU (again, there's not a formal set of boundaries) is
> > what translates that logical address to a physical address (if we're running DAT on). After that happen
> > is when you start to hit caches (with the exception of the overlap done in many L1 caches).
> >
> >
> > > Also, heres on bit of information I'm missing that is probably dreadfully obvious but...
> > > What is a "cache line"? It is an entry or a part of an entry? I don't understand that.
> >
> >
> > The definition makes the most sense for a set-associative cache. A cache line is what's referred to
> > by the index used to do the first part of the lookup in the cache. A cache line contains several (again,
> > ignoring direct mapped or fully associative caches) cache entries. Let's say the we're using the scheme
> > previously described, where the index is the seventh through twelfth lowest bits of the address. IOW,
> > if we have a 32 bit (physical) address we can consider it split as follows (in bits):
> >
> > ttttttttttttttttiiiiiioooooo
> >
> > The oooooooo is the offset within the 64 bytes cache entry. The iiiiii is the
> > index into the cache, and the ttt..ttt is the sixteen bit tag comparator.
> >
> > Thus any address with the same iiiiii bits will index to the same cache line. If we allow more
> > than one cache entry in a cache line we have a N-way set associative cache. So if we have a four
> > way set associative cache, each cache line will contain four cache entries, each of which can hold
> > a different block of 64 bytes (although all the blocks/entries will have the same index).
> >
> > If we're talking about caches in generic terms (from a program visible
> > perspective), we often use cache line as a synonym for cache entry.
> >
> >
> > > Lastly, just to verify, the page walker is the component that does the "searching"
> > > with the physical address in the event of a TLB miss, correct?
> >
> > The page walker searches *for* the physical address in the page tables, given a virtual
> > address as input. It should generate the same result as a TLB hit (which also translates
> > a virtual address to a physical address), but does the process the hard way - instead of
> > just looking at recent translation to see if we've done it already (which is what the TLB
> > does), the page walker has to step through the actual page tables to get the result.
> >
>
> Sorry for the late reply, I've been awfully busy lately. I really do appreciate you
> taking the time to educate me though, this is really, really interesting stuff!
>
> So when the LSU searches both the TLB and the L1 cache, it is looking at ALL the tags in the L1 cache, and
> once found, it compares the ENTIRE entry that was found in the L1 cache to the translation in the TLB? How
> often do these not match? It'd seem like they would 100% of the time... Maybe I'm not seeing something.
If you had a fully associative cache, the lookup would search all the tags (one for each cache entry) for one matching the desired address. If you had a direct-mapped cache, the index developed from the address would be used to select and read a single cache entry, and then that one entries tag would be compared. For a set-associative cache, one entire line of cache entries would be selected (and read), and then the tag for each of those cache entries would be compared.
> So basically, the AGU converts the calculations for the proposed virtual
> (or physical with DAT off?) and hands the tag off to the LSU?
No, the tag is in the cache entry - it says what part of memory is held by the cache line. It is compared to part of the address being looked up. For example, with 64 byte cache entries, you don't need to compare the low six bits (since those would just select a particular byte within the cache entry). With a direct-mapped or set-associative cache, the index does not need to be stored in the tag or compared either, since that part of the comparison happens as a side effect of the hashing process - IOW, all addresses with 010111 in the second six bits (from the low end), are going to go into cache line 39, so if you're looking for an address with 010111 in those bits, you could only ever find it in cache line 39 - so you so have to store those bits in the tag.
The AGU generates a logical address, which is then either treated as a physical address, and used as is, or as a virtual address, and then translated to a physical address.
> So basically, cache lines are a list of cache entries (the tags atleast) consisting of one
> of each entry from however large the associativeness is? That would assume that a directly
> mapped cache has a cache line with only one entry in it at a time and a fully associative
> cache would have the tag of all the entries of the cache in it at one time, correct?
A cache line is (almost always) several complete cache entries. If you had a four way set associative cache with 64 byte entries and 40 bits (five bytes) of additional info (tag, validity flags, protection bits, and whatever else the implementation decided to toss in there), a read of one cache line would actually read some 276* bytes out of the cache, all at once. It's reasonable to consider that four-way set associative cache to be a 2208 bit wide memory.
It is technically correct to consider a fully associative cache as having one line (although you would use a different physical implementation - there is not a reason to ever "read" the whole "line" in a fully associative cache, it's effectively already available), and a direct mapped cache as having cache lines with a single cache entry in them, but no one ever uses the terms like that. Similarly no one ever calls sailplanes "zero engine airplanes" (although that's certainly true).
*276 = 4*(64+5)
> So the page walker uses the virtual address to locate the physical address of the needed cache entry?
The page table walker finds the physical address associated with the virtual address. It's used only when the TLB does not already know that. In either case the resulting physical* address is used to search the cache.
*Again, ignoring the existence/possibility of virtual addressing in caches.