By: Linus Torvalds (torvalds.delete@this.linux-foundation.org), October 3, 2021 11:09 am
Room: Moderated Discussions
Doug S (foo.delete@this.bar.bar) on October 3, 2021 10:09 am wrote:
>
> Zeroing has room for optimization, both since you will often zero more than one page at a time and
> because zeroes are rarely read before they are overwritten - so you want that activity to occur outside
> of the cache.
No you don't, actually.
People have tried various pre-zeroing schemes over and over and over again, and it's always been a loss in the end.
Why? Caches work, and they grow over time. And basically every single time you zero something, you are doing so because you're going to access much of the end result - even if it's just to overwrite it with final data - in the not too distant future.
Pre-zeroing and doing it at a DRAM or memory controller level is always going to be the wrong answer. It's going to mean that when you access it, you're now going to take that very expensive cache miss.
Yes, you can always find benchmarks where pre-zeroing is great, because you can pick the benchmark where you have just the right working set size, and you can time the memory operations to when they are most effective for that benchmark.
And then on real loads it won't work at all. In fact, even on the benchmark it will be a loss on other microarchitecures with bigger caches - so you're basically pessimising for the future.
So what you want to do is to zero your memory basically as late as possible, just before it gets used. That way the data will be close when it is accessed. Even if it's accessed just for writing the actual new data on top - a lot of zeroing is for initialization and security reasons, and to make for consistent behavior - it will be at least already dirty and exclusive in your caches, which is exactly what you want for a write.
So for big sparse arrays (or huge initial allocations), you may actually be much better off allocating them with something like a "mmap()" interface for anonymous memory (pick whatever non-unix equivalent), and just telling the system that you will need this much memory, but then depend on demand-paging to zero the pages for you before use.
Yes, you'll then take the page faults dynamically, but it might well end up much better than pre-zeroing big buffers that you won't use for a while.
As a rule of thumb, you never ever want to move memory accesses closer to DRAM, unless you have been explicitly told "I don't want this data any more" (or you have some really good detection of "this working set won't fit in any caches").
DRAM is just too far away, and caches are too effective - and you very seldom know how much cache you have on a software level.
Side note: that detection of "this working set won't fit in any caches" may well be about the CPU knowing the size of a memory copy or memory clear operation ahead of time, and taking those kinds of very explicit hints into account. Which is just another reason you should have memory copy support in hardware, and not do it in software.
Linus
>
> Zeroing has room for optimization, both since you will often zero more than one page at a time and
> because zeroes are rarely read before they are overwritten - so you want that activity to occur outside
> of the cache.
No you don't, actually.
People have tried various pre-zeroing schemes over and over and over again, and it's always been a loss in the end.
Why? Caches work, and they grow over time. And basically every single time you zero something, you are doing so because you're going to access much of the end result - even if it's just to overwrite it with final data - in the not too distant future.
Pre-zeroing and doing it at a DRAM or memory controller level is always going to be the wrong answer. It's going to mean that when you access it, you're now going to take that very expensive cache miss.
Yes, you can always find benchmarks where pre-zeroing is great, because you can pick the benchmark where you have just the right working set size, and you can time the memory operations to when they are most effective for that benchmark.
And then on real loads it won't work at all. In fact, even on the benchmark it will be a loss on other microarchitecures with bigger caches - so you're basically pessimising for the future.
So what you want to do is to zero your memory basically as late as possible, just before it gets used. That way the data will be close when it is accessed. Even if it's accessed just for writing the actual new data on top - a lot of zeroing is for initialization and security reasons, and to make for consistent behavior - it will be at least already dirty and exclusive in your caches, which is exactly what you want for a write.
So for big sparse arrays (or huge initial allocations), you may actually be much better off allocating them with something like a "mmap()" interface for anonymous memory (pick whatever non-unix equivalent), and just telling the system that you will need this much memory, but then depend on demand-paging to zero the pages for you before use.
Yes, you'll then take the page faults dynamically, but it might well end up much better than pre-zeroing big buffers that you won't use for a while.
As a rule of thumb, you never ever want to move memory accesses closer to DRAM, unless you have been explicitly told "I don't want this data any more" (or you have some really good detection of "this working set won't fit in any caches").
DRAM is just too far away, and caches are too effective - and you very seldom know how much cache you have on a software level.
Side note: that detection of "this working set won't fit in any caches" may well be about the CPU knowing the size of a memory copy or memory clear operation ahead of time, and taking those kinds of very explicit hints into account. Which is just another reason you should have memory copy support in hardware, and not do it in software.
Linus