By: Patrick Chase (patrickjchase.delete@this.gmail.com), July 5, 2013 10:37 am
Room: Moderated Discussions
Etienne (etienne_lorrain.delete@this.yahoo.fr) on July 5, 2013 3:14 am wrote:
> Sorry to switch subject, I have a question I do not know where to ask:
> (please note I am not a hardware specialist)
>
> A processor with two layer cache writes a byte to zero, none of the memory is in cache.
> So the processor system looks for the address in layer 1 cache, do not find it, looks
> in layer 2 cache, do not find it, instruct the layer 2 cache to fetch the memory.
This depends on the allocation policies of the L1 and L2 caches. Many modern processors default to "allocate on read miss" (or simply "read-allocate") for either L1 or both, which means that a cache line will only be allocated if a *load* misses the cache. You've specified a store above, so in such a core there would be no changes to the cache contents. The reasoning behind the read-allocate policy is that many workloads involve streaming write-only data (no temporal locality, entire cache line will be over-written). Loading the old version of such data from memory or evicting other data from cache are both counterproductive, so you ideally want it to bypass cache.
Microarchitectures with "allocate on read miss" policies invariably feature write-combining buffers, which are a small number of cache-line-sized buffers that accumulate data from stores that miss cache. This allows the core to exploit spatial locality by coalescing adjacent writes into (up to) cache-line-sized transactions. In most microarchitectures loads can snoop the write-combining buffers. In my experience the combination of read-allocate caches and snooped write-combining buffers works very well for a wide range of workloads.
> When the layer 1 cache line is filled, the instruction to clear the byte executes.
> If at that point both layer cache are no more needed and evicted,
No. If the cache uses an allocate-on-write-miss allocation policy then the line will stay in cache until evicted based on its replacement policy (LRU, random, adaptive, etc).
> what size is written
> back to memory: a layer 1 cache line or a layer 2 cache line (which would be a lot bigger).
> In other words, is there "dirty bits" for every layer 1 cache line inside the layer 2 cache?
The line sizes for all levels are typically the same. If they weren't then coherency and OS cache management would become more complicated. With that said, a cache that has valid/dirty bits for partial lines is called a sectored cache (a very old technique first used in the 360/85). The only recent microarchticture that I know of that uses a sectored cache is the NVIDIA Fermi GPU family, which appear to use a sectored L2 with 128-byte lines and 32-byte sectors [*].
-- Patrick
[*] I determined that empirically by looking at the relative performance of 32-byte texture cache line fills and 128-byte L1 line fills with various address patterns - Nvidia have never publically described the Fermi L2 configuration that I know of.
> Sorry to switch subject, I have a question I do not know where to ask:
> (please note I am not a hardware specialist)
>
> A processor with two layer cache writes a byte to zero, none of the memory is in cache.
> So the processor system looks for the address in layer 1 cache, do not find it, looks
> in layer 2 cache, do not find it, instruct the layer 2 cache to fetch the memory.
This depends on the allocation policies of the L1 and L2 caches. Many modern processors default to "allocate on read miss" (or simply "read-allocate") for either L1 or both, which means that a cache line will only be allocated if a *load* misses the cache. You've specified a store above, so in such a core there would be no changes to the cache contents. The reasoning behind the read-allocate policy is that many workloads involve streaming write-only data (no temporal locality, entire cache line will be over-written). Loading the old version of such data from memory or evicting other data from cache are both counterproductive, so you ideally want it to bypass cache.
Microarchitectures with "allocate on read miss" policies invariably feature write-combining buffers, which are a small number of cache-line-sized buffers that accumulate data from stores that miss cache. This allows the core to exploit spatial locality by coalescing adjacent writes into (up to) cache-line-sized transactions. In most microarchitectures loads can snoop the write-combining buffers. In my experience the combination of read-allocate caches and snooped write-combining buffers works very well for a wide range of workloads.
> When the layer 1 cache line is filled, the instruction to clear the byte executes.
> If at that point both layer cache are no more needed and evicted,
No. If the cache uses an allocate-on-write-miss allocation policy then the line will stay in cache until evicted based on its replacement policy (LRU, random, adaptive, etc).
> what size is written
> back to memory: a layer 1 cache line or a layer 2 cache line (which would be a lot bigger).
> In other words, is there "dirty bits" for every layer 1 cache line inside the layer 2 cache?
The line sizes for all levels are typically the same. If they weren't then coherency and OS cache management would become more complicated. With that said, a cache that has valid/dirty bits for partial lines is called a sectored cache (a very old technique first used in the 360/85). The only recent microarchticture that I know of that uses a sectored cache is the NVIDIA Fermi GPU family, which appear to use a sectored L2 with 128-byte lines and 32-byte sectors [*].
-- Patrick
[*] I determined that empirically by looking at the relative performance of 32-byte texture cache line fills and 128-byte L1 line fills with various address patterns - Nvidia have never publically described the Fermi L2 configuration that I know of.