By: Doug S (foo.delete@this.bar.bar), October 5, 2021 11:21 am
Room: Moderated Discussions
Paul A. Clayton (paaronclayton.delete@this.gmail.com) on October 5, 2021 11:19 am wrote:
> Linus Torvalds (torvalds.delete@this.linux-foundation.org) on October 3, 2021 11:09 am wrote:
> > Doug S (foo.delete@this.bar.bar) on October 3, 2021 10:09 am wrote:
> >>
> >> Zeroing has room for optimization, both since you will often zero more than one page at a time and
> >> because zeroes are rarely read before they are overwritten - so you want that activity to occur outside
> >> of the cache.
> >
> > No you don't, actually.
>
> Yes, one does want to avoid evicting a cache line to fill it
> with zeroes that will typically be overwritten before read.
>
> One does want to avoid write misses, but this does not require writing the data. A straightforward method
> to defer such actual zeroing would be to use cache compression that supports cache line granular deduplication.
> Tracking zero pages and having hardware fill on demand seems better (less storage overhead) — compressing
> at page granularity — presumably with decomposition to cache line granularity.
>
> This seems to be merely having hardware perform the same COW optimization
> that OSes typically use a virtual memory page granularity.
I'm not sure CPU architects would appreciate how casually you tossed out "merely" in that context, given the importance of latency to overall cache performance. How few cycles could you add for COWing cache lines before you cancel out the performance increase from not having a full set of zeroed lines? Anywhere from "less than one" to "not very many", I imagine.
The problem isn't so much writing twice to cache in quick succession, it is writing twice to DRAM in quick succession - that's what you must avoid. You want to the rewrite of those zeroed lines to quickly follow filling them with zeroes. Creating a page's worth of zeroes in cache requires two things, allocating the lines and writing the zeroes, and if writing zeroes dominated that time it could be easily optimized away by per line/subline "is zero" bits.
Writing zeroes only dominates if all the lines you allocate are clean. Once you start allocating dirty lines that need to be flushed before they can be used you are forced to wait on main memory before the page is made available. When you need one new page you often need more so you will be doing dirty allocation a lot of the time.
> Linus Torvalds (torvalds.delete@this.linux-foundation.org) on October 3, 2021 11:09 am wrote:
> > Doug S (foo.delete@this.bar.bar) on October 3, 2021 10:09 am wrote:
> >>
> >> Zeroing has room for optimization, both since you will often zero more than one page at a time and
> >> because zeroes are rarely read before they are overwritten - so you want that activity to occur outside
> >> of the cache.
> >
> > No you don't, actually.
>
> Yes, one does want to avoid evicting a cache line to fill it
> with zeroes that will typically be overwritten before read.
>
> One does want to avoid write misses, but this does not require writing the data. A straightforward method
> to defer such actual zeroing would be to use cache compression that supports cache line granular deduplication.
> Tracking zero pages and having hardware fill on demand seems better (less storage overhead) — compressing
> at page granularity — presumably with decomposition to cache line granularity.
>
> This seems to be merely having hardware perform the same COW optimization
> that OSes typically use a virtual memory page granularity.
I'm not sure CPU architects would appreciate how casually you tossed out "merely" in that context, given the importance of latency to overall cache performance. How few cycles could you add for COWing cache lines before you cancel out the performance increase from not having a full set of zeroed lines? Anywhere from "less than one" to "not very many", I imagine.
The problem isn't so much writing twice to cache in quick succession, it is writing twice to DRAM in quick succession - that's what you must avoid. You want to the rewrite of those zeroed lines to quickly follow filling them with zeroes. Creating a page's worth of zeroes in cache requires two things, allocating the lines and writing the zeroes, and if writing zeroes dominated that time it could be easily optimized away by per line/subline "is zero" bits.
Writing zeroes only dominates if all the lines you allocate are clean. Once you start allocating dirty lines that need to be flushed before they can be used you are forced to wait on main memory before the page is made available. When you need one new page you often need more so you will be doing dirty allocation a lot of the time.