By: Melody Speer (pseudo.delete@this.nym.net), August 4, 2013 3:11 am
Room: Moderated Discussions
Sebastian Soeiro (sebastian_2896.delete@this.hotmail.com) on August 3, 2013 8:23 am wrote:
> So an instruction asks the LSU to get ahold of "#12345678", and it asks the LSU to translate the virtual address
> into a physical address (here is a blank. How does it do this? What unit translates the virtual address into
> a physical address, and how?)
Er... no. The LSU is part of the CPU core proper, and it has two jobs:
1. Compute the address #12345678
2. Stall the processor until the result is ready.
The address #12345678 is then sent to two places. The lower 12 bits (#678) are sent to the L1 cache and used to look up several possible cache lines.
At the same time, the upper 20 bits (#12345) are sent to the TLB. I'm going to assume a processor with a 64 GB physical address space (like 32-bit x86 processors with PAE), so if we get lucky and get a TLB hit, we discover that this corresponds to the physical address #abcdef. (If we get a TLB miss, we need to do a page table walk and try again.)
Then we take those upper 24 physical address bits and compare them to the set of candidates found in the L1 cache. If we are lucky, one of them matches (#abcdef789 is found in the L1 cache) and we're done.
If the L1 cache misses, we have the full physical address #abcde789 to send to higher levels in the cache hierarchy: L2, L3, and main memory.
A typical L1 cache is made up of 64-byte cache lines. In addition to the data, each line has a "cache tag": upper address bits (24 bits in my example), a validity bit, a dirty bit, and possibly some LRU bits to control cache replacement.
64 of those lines (4096 bytes) make up a cache way. These are direct-mapped: line 0 starts at address 0, line 1 starts at address 64, line 2 starts at address 128, and so on. Line 63 starts at address 4032. Thus, for any given low address, there's only one possible cache line that could contain that data. (Our example #789 is line 30, which covers addresses #780 through #7bf.) You can fetch that line and check the high-order bits for a match.
Then (again, I'm using typical numbers) 8 of those ways make up the full 32K L1 cache. When given an address, the 8 ways find a candidate cache line in parallel. When they're done, the TLB has the upper address bits, and those are checked against the 8 candidates to find the correct one. Then the lowest-order bits choose the correct part of that line.
L1 cache latency is critical and there are a huge number of tricks that have been tried to speed it up. For example, the L1 cache may return some data to the LSU saying "I think this is the data you want, but I may be wrong". The CPU can then start processing with that data, but back up and retry if it turns out that the cache's first guess was wrong.
> So an instruction asks the LSU to get ahold of "#12345678", and it asks the LSU to translate the virtual address
> into a physical address (here is a blank. How does it do this? What unit translates the virtual address into
> a physical address, and how?)
Er... no. The LSU is part of the CPU core proper, and it has two jobs:
1. Compute the address #12345678
2. Stall the processor until the result is ready.
The address #12345678 is then sent to two places. The lower 12 bits (#678) are sent to the L1 cache and used to look up several possible cache lines.
At the same time, the upper 20 bits (#12345) are sent to the TLB. I'm going to assume a processor with a 64 GB physical address space (like 32-bit x86 processors with PAE), so if we get lucky and get a TLB hit, we discover that this corresponds to the physical address #abcdef. (If we get a TLB miss, we need to do a page table walk and try again.)
Then we take those upper 24 physical address bits and compare them to the set of candidates found in the L1 cache. If we are lucky, one of them matches (#abcdef789 is found in the L1 cache) and we're done.
If the L1 cache misses, we have the full physical address #abcde789 to send to higher levels in the cache hierarchy: L2, L3, and main memory.
A typical L1 cache is made up of 64-byte cache lines. In addition to the data, each line has a "cache tag": upper address bits (24 bits in my example), a validity bit, a dirty bit, and possibly some LRU bits to control cache replacement.
64 of those lines (4096 bytes) make up a cache way. These are direct-mapped: line 0 starts at address 0, line 1 starts at address 64, line 2 starts at address 128, and so on. Line 63 starts at address 4032. Thus, for any given low address, there's only one possible cache line that could contain that data. (Our example #789 is line 30, which covers addresses #780 through #7bf.) You can fetch that line and check the high-order bits for a match.
Then (again, I'm using typical numbers) 8 of those ways make up the full 32K L1 cache. When given an address, the 8 ways find a candidate cache line in parallel. When they're done, the TLB has the upper address bits, and those are checked against the 8 candidates to find the correct one. Then the lowest-order bits choose the correct part of that line.
L1 cache latency is critical and there are a huge number of tricks that have been tried to speed it up. For example, the L1 cache may return some data to the LSU saying "I think this is the data you want, but I may be wrong". The CPU can then start processing with that data, but back up and retry if it turns out that the cache's first guess was wrong.