By: Mark Roulo (nothanks.delete@this.xxx.com), October 3, 2021 12:41 pm
Room: Moderated Discussions
Linus Torvalds (torvalds.delete@this.linux-foundation.org) on October 3, 2021 11:09 am wrote:
> Doug S (foo.delete@this.bar.bar) on October 3, 2021 10:09 am wrote:
> >
> > Zeroing has room for optimization, both since you will often zero more than one page at a time and
> > because zeroes are rarely read before they are overwritten - so you want that activity to occur outside
> > of the cache.
>
> No you don't, actually.
>
> People have tried various pre-zeroing schemes over and over
> and over again, and it's always been a loss in the end.
>
> Why? Caches work, and they grow over time. And basically every single time you zero
> something, you are doing so because you're going to access much of the end result -
> even if it's just to overwrite it with final data - in the not too distant future.
>
> Pre-zeroing and doing it at a DRAM or memory controller level is always going to be the wrong answer.
> It's going to mean that when you access it, you're now going to take that very expensive cache miss.
>
> Yes, you can always find benchmarks where pre-zeroing is great, because you can
> pick the benchmark where you have just the right working set size, and you can time
> the memory operations to when they are most effective for that benchmark.
>
> And then on real loads it won't work at all. In fact, even on the benchmark it will be a loss on
> other microarchitecures with bigger caches - so you're basically pessimising for the future.
>
> So what you want to do is to zero your memory basically as late as possible, just before it gets used. That
> way the data will be close when it is accessed. Even if it's accessed just for writing the actual new data on
> top - a lot of zeroing is for initialization and security reasons, and to make for consistent behavior - it
> will be at least already dirty and exclusive in your caches, which is exactly what you want for a write.
>
> So for big sparse arrays (or huge initial allocations), you may actually be much better
> off allocating them with something like a "mmap()" interface for anonymous memory (pick
> whatever non-unix equivalent), and just telling the system that you will need this much
> memory, but then depend on demand-paging to zero the pages for you before use.
>
> Yes, you'll then take the page faults dynamically, but it might well end up
> much better than pre-zeroing big buffers that you won't use for a while.
>
> As a rule of thumb, you never ever want to move memory accesses closer to DRAM,
> unless you have been explicitly told "I don't want this data any more" (or you have
> some really good detection of "this working set won't fit in any caches").
>
> DRAM is just too far away, and caches are too effective - and you
> very seldom know how much cache you have on a software level.
>
> Side note: that detection of "this working set won't fit in any caches" may well be
> about the CPU knowing the size of a memory copy or memory clear operation ahead of time,
> and taking those kinds of very explicit hints into account. Which is just another reason
> you should have memory copy support in hardware, and not do it in software.
>
> Linus
How confident are you about this for languages such as Java? Especially Java running in a server context with sophisticated multi-threaded GC?
This seems like the sort of environment where zero-ing the data to be allocated later might be a win, especially if the data could be zero-d when nothing else needed the DRAM bandwidth.
> Doug S (foo.delete@this.bar.bar) on October 3, 2021 10:09 am wrote:
> >
> > Zeroing has room for optimization, both since you will often zero more than one page at a time and
> > because zeroes are rarely read before they are overwritten - so you want that activity to occur outside
> > of the cache.
>
> No you don't, actually.
>
> People have tried various pre-zeroing schemes over and over
> and over again, and it's always been a loss in the end.
>
> Why? Caches work, and they grow over time. And basically every single time you zero
> something, you are doing so because you're going to access much of the end result -
> even if it's just to overwrite it with final data - in the not too distant future.
>
> Pre-zeroing and doing it at a DRAM or memory controller level is always going to be the wrong answer.
> It's going to mean that when you access it, you're now going to take that very expensive cache miss.
>
> Yes, you can always find benchmarks where pre-zeroing is great, because you can
> pick the benchmark where you have just the right working set size, and you can time
> the memory operations to when they are most effective for that benchmark.
>
> And then on real loads it won't work at all. In fact, even on the benchmark it will be a loss on
> other microarchitecures with bigger caches - so you're basically pessimising for the future.
>
> So what you want to do is to zero your memory basically as late as possible, just before it gets used. That
> way the data will be close when it is accessed. Even if it's accessed just for writing the actual new data on
> top - a lot of zeroing is for initialization and security reasons, and to make for consistent behavior - it
> will be at least already dirty and exclusive in your caches, which is exactly what you want for a write.
>
> So for big sparse arrays (or huge initial allocations), you may actually be much better
> off allocating them with something like a "mmap()" interface for anonymous memory (pick
> whatever non-unix equivalent), and just telling the system that you will need this much
> memory, but then depend on demand-paging to zero the pages for you before use.
>
> Yes, you'll then take the page faults dynamically, but it might well end up
> much better than pre-zeroing big buffers that you won't use for a while.
>
> As a rule of thumb, you never ever want to move memory accesses closer to DRAM,
> unless you have been explicitly told "I don't want this data any more" (or you have
> some really good detection of "this working set won't fit in any caches").
>
> DRAM is just too far away, and caches are too effective - and you
> very seldom know how much cache you have on a software level.
>
> Side note: that detection of "this working set won't fit in any caches" may well be
> about the CPU knowing the size of a memory copy or memory clear operation ahead of time,
> and taking those kinds of very explicit hints into account. Which is just another reason
> you should have memory copy support in hardware, and not do it in software.
>
> Linus
How confident are you about this for languages such as Java? Especially Java running in a server context with sophisticated multi-threaded GC?
This seems like the sort of environment where zero-ing the data to be allocated later might be a win, especially if the data could be zero-d when nothing else needed the DRAM bandwidth.