By: Michael S (already5chosen.delete@this.yahoo.com), July 6, 2013 9:57 am
Room: Moderated Discussions
Patrick Chase (patrickjchase.delete@this.gmail.com) on July 5, 2013 11:37 am wrote:
> Etienne (etienne_lorrain.delete@this.yahoo.fr) on July 5, 2013 3:14 am wrote:
> > Sorry to switch subject, I have a question I do not know where to ask:
> > (please note I am not a hardware specialist)
> >
> > A processor with two layer cache writes a byte to zero, none of the memory is in cache.
> > So the processor system looks for the address in layer 1 cache, do not find it, looks
> > in layer 2 cache, do not find it, instruct the layer 2 cache to fetch the memory.
>
> This depends on the allocation policies of the L1 and L2 caches. Many modern processors default
> to "allocate on read miss" (or simply "read-allocate") for either L1 or both, which means that a
> cache line will only be allocated if a *load* misses the cache. You've specified a store above,
> so in such a core there would be no changes to the cache contents. The reasoning behind the read-allocate
> policy is that many workloads involve streaming write-only data (no temporal locality, entire cache
> line will be over-written). Loading the old version of such data from memory or evicting other data
> from cache are both counterproductive, so you ideally want it to bypass cache.
>
Huh?
Show me not "many", but just one modern general-purpose processor with write-back cache that does not write-allocate by default. AFAIK, there are none.
Streaming stores are another matter.
> Microarchitectures with "allocate on read miss" policies invariably feature write-combining buffers, which are
> a small number of cache-line-sized buffers that accumulate data from stores that miss cache. This allows the
> core to exploit spatial locality by coalescing adjacent writes into (up to) cache-line-sized transactions. In
> most microarchitectures loads can snoop the write-combining buffers. In my experience the combination of read-allocate
> caches and snooped write-combining buffers works very well for a wide range of workloads.
>
> > When the layer 1 cache line is filled, the instruction to clear the byte executes.
> > If at that point both layer cache are no more needed and evicted,
>
> No. If the cache uses an allocate-on-write-miss allocation policy then the line will stay
> in cache until evicted based on its replacement policy (LRU, random, adaptive, etc).
>
> > what size is written
> > back to memory: a layer 1 cache line or a layer 2 cache line (which would be a lot bigger).
> > In other words, is there "dirty bits" for every layer 1 cache line inside the layer 2 cache?
>
> The line sizes for all levels are typically the same. If they weren't then coherency and OS cache
> management would become more complicated. With that said, a cache that has valid/dirty bits for
> partial lines is called a sectored cache (a very old technique first used in the 360/85). The
> only recent microarchticture that I know of that uses a sectored cache is the NVIDIA Fermi GPU
> family, which appear to use a sectored L2 with 128-byte lines and 32-byte sectors [*].
>
> -- Patrick
Crystallwell L4 should do something similar. It's not yet documented in the Intel optimization reference manual, but anything else simply does not make a technical sense.
>
> [*] I determined that empirically by looking at the relative performance of 32-byte
> texture cache line fills and 128-byte L1 line fills with various address patterns -
> Nvidia have never publically described the Fermi L2 configuration that I know of.
>
> Etienne (etienne_lorrain.delete@this.yahoo.fr) on July 5, 2013 3:14 am wrote:
> > Sorry to switch subject, I have a question I do not know where to ask:
> > (please note I am not a hardware specialist)
> >
> > A processor with two layer cache writes a byte to zero, none of the memory is in cache.
> > So the processor system looks for the address in layer 1 cache, do not find it, looks
> > in layer 2 cache, do not find it, instruct the layer 2 cache to fetch the memory.
>
> This depends on the allocation policies of the L1 and L2 caches. Many modern processors default
> to "allocate on read miss" (or simply "read-allocate") for either L1 or both, which means that a
> cache line will only be allocated if a *load* misses the cache. You've specified a store above,
> so in such a core there would be no changes to the cache contents. The reasoning behind the read-allocate
> policy is that many workloads involve streaming write-only data (no temporal locality, entire cache
> line will be over-written). Loading the old version of such data from memory or evicting other data
> from cache are both counterproductive, so you ideally want it to bypass cache.
>
Huh?
Show me not "many", but just one modern general-purpose processor with write-back cache that does not write-allocate by default. AFAIK, there are none.
Streaming stores are another matter.
> Microarchitectures with "allocate on read miss" policies invariably feature write-combining buffers, which are
> a small number of cache-line-sized buffers that accumulate data from stores that miss cache. This allows the
> core to exploit spatial locality by coalescing adjacent writes into (up to) cache-line-sized transactions. In
> most microarchitectures loads can snoop the write-combining buffers. In my experience the combination of read-allocate
> caches and snooped write-combining buffers works very well for a wide range of workloads.
>
> > When the layer 1 cache line is filled, the instruction to clear the byte executes.
> > If at that point both layer cache are no more needed and evicted,
>
> No. If the cache uses an allocate-on-write-miss allocation policy then the line will stay
> in cache until evicted based on its replacement policy (LRU, random, adaptive, etc).
>
> > what size is written
> > back to memory: a layer 1 cache line or a layer 2 cache line (which would be a lot bigger).
> > In other words, is there "dirty bits" for every layer 1 cache line inside the layer 2 cache?
>
> The line sizes for all levels are typically the same. If they weren't then coherency and OS cache
> management would become more complicated. With that said, a cache that has valid/dirty bits for
> partial lines is called a sectored cache (a very old technique first used in the 360/85). The
> only recent microarchticture that I know of that uses a sectored cache is the NVIDIA Fermi GPU
> family, which appear to use a sectored L2 with 128-byte lines and 32-byte sectors [*].
>
> -- Patrick
Crystallwell L4 should do something similar. It's not yet documented in the Intel optimization reference manual, but anything else simply does not make a technical sense.
>
> [*] I determined that empirically by looking at the relative performance of 32-byte
> texture cache line fills and 128-byte L1 line fills with various address patterns -
> Nvidia have never publically described the Fermi L2 configuration that I know of.
>