By: Paul A. Clayton (paaronclayton.delete@this.gmail.com), October 5, 2021 11:19 am
Room: Moderated Discussions
Linus Torvalds (torvalds.delete@this.linux-foundation.org) on October 3, 2021 11:09 am wrote:
> Doug S (foo.delete@this.bar.bar) on October 3, 2021 10:09 am wrote:
>>
>> Zeroing has room for optimization, both since you will often zero more than one page at a time and
>> because zeroes are rarely read before they are overwritten - so you want that activity to occur outside
>> of the cache.
>
> No you don't, actually.
Yes, one does want to avoid evicting a cache line to fill it with zeroes that will typically be overwritten before read.
One does want to avoid write misses, but this does not require writing the data. A straightforward method to defer such actual zeroing would be to use cache compression that supports cache line granular deduplication. Tracking zero pages and having hardware fill on demand seems better (less storage overhead) — compressing at page granularity — presumably with decomposition to cache line granularity.
This seems to be merely having hardware perform the same COW optimization that OSes typically use a virtual memory page granularity.
(The Mill proposal even uses on-demand dynamic physical page mapping [a small benefit of virtual caches]. For small datasets in short-lived programs, writeback to memory might never happen; while an OS could attempt such by preferentially using recently freed pages, the OS is not generally aware of [or able to discover cheaply] what is cached.)
In theory, application-level code could free/zero memory at cache line granularity (though the overhead of even a single instruction might not be worth the benefit in cache capacity use).
Even when very large pages are desired for reducing address translation overhead, one could, in theory, track clearing at finer granularity. (Here I think the Mill as proposed has issues. As I understand it, zeroing is tied to page size and page coalescing is 'lazy' rather than 'eager' as a consequence of zero regions being defined such that a materialized page cannot be to-be-zeroed. Eager allocation of larger pages seems impractical in this design. This is not something especially difficult to change, but I perceive such as less than ideal.)
[snip]
> Pre-zeroing and doing it at a DRAM or memory controller level is always going to be the wrong answer.
> It's going to mean that when you access it, you're now going to take that very expensive cache miss.
I could agree that zeroing at DRAM will usually not be appropriate, I rather suspect there are cases where it would be useful. The cost of DRAM row zeroing could be very low (the size difference between virtual memory page and DRAM row/page may constrain the utility of such).
[snip]
> So what you want to do is to zero your memory basically as late as possible, just before it gets used.
Physically that may be nearly true (zeroing a write buffer on a cache line write miss to a zero cache line), but having software declare earlier that a large block is zero may be useful in reducing software activity, reducing software complexity, and facilitating some hardware optimizations.
If one is coding 'close to the metal', precisely optimized software may be appropriate (like software prefetch or even branch hints) with the recognition that more precise optimizations tend to be more fragile.
The 'do what I tell you to do when I tell you to do it' model of software-hardware interaction may not be ideal in all circumstances. Even with the 'as if' rule (which may have side-channel security or performance reliability issues, i.e., hardware might be caught not doing exactly as software instructed), this introduces a hindrance to optimization.
(I do not have a proposal for a better method of communicating between software and hardware — I do think it should be bi-directional. I suspect experts in information theory and economics could significantly help in the engineering of superior interfaces.)
> Doug S (foo.delete@this.bar.bar) on October 3, 2021 10:09 am wrote:
>>
>> Zeroing has room for optimization, both since you will often zero more than one page at a time and
>> because zeroes are rarely read before they are overwritten - so you want that activity to occur outside
>> of the cache.
>
> No you don't, actually.
Yes, one does want to avoid evicting a cache line to fill it with zeroes that will typically be overwritten before read.
One does want to avoid write misses, but this does not require writing the data. A straightforward method to defer such actual zeroing would be to use cache compression that supports cache line granular deduplication. Tracking zero pages and having hardware fill on demand seems better (less storage overhead) — compressing at page granularity — presumably with decomposition to cache line granularity.
This seems to be merely having hardware perform the same COW optimization that OSes typically use a virtual memory page granularity.
(The Mill proposal even uses on-demand dynamic physical page mapping [a small benefit of virtual caches]. For small datasets in short-lived programs, writeback to memory might never happen; while an OS could attempt such by preferentially using recently freed pages, the OS is not generally aware of [or able to discover cheaply] what is cached.)
In theory, application-level code could free/zero memory at cache line granularity (though the overhead of even a single instruction might not be worth the benefit in cache capacity use).
Even when very large pages are desired for reducing address translation overhead, one could, in theory, track clearing at finer granularity. (Here I think the Mill as proposed has issues. As I understand it, zeroing is tied to page size and page coalescing is 'lazy' rather than 'eager' as a consequence of zero regions being defined such that a materialized page cannot be to-be-zeroed. Eager allocation of larger pages seems impractical in this design. This is not something especially difficult to change, but I perceive such as less than ideal.)
[snip]
> Pre-zeroing and doing it at a DRAM or memory controller level is always going to be the wrong answer.
> It's going to mean that when you access it, you're now going to take that very expensive cache miss.
I could agree that zeroing at DRAM will usually not be appropriate, I rather suspect there are cases where it would be useful. The cost of DRAM row zeroing could be very low (the size difference between virtual memory page and DRAM row/page may constrain the utility of such).
[snip]
> So what you want to do is to zero your memory basically as late as possible, just before it gets used.
Physically that may be nearly true (zeroing a write buffer on a cache line write miss to a zero cache line), but having software declare earlier that a large block is zero may be useful in reducing software activity, reducing software complexity, and facilitating some hardware optimizations.
If one is coding 'close to the metal', precisely optimized software may be appropriate (like software prefetch or even branch hints) with the recognition that more precise optimizations tend to be more fragile.
The 'do what I tell you to do when I tell you to do it' model of software-hardware interaction may not be ideal in all circumstances. Even with the 'as if' rule (which may have side-channel security or performance reliability issues, i.e., hardware might be caught not doing exactly as software instructed), this introduces a hindrance to optimization.
(I do not have a proposal for a better method of communicating between software and hardware — I do think it should be bi-directional. I suspect experts in information theory and economics could significantly help in the engineering of superior interfaces.)