By: Mark Roulo (nothanks.delete@this.xxx.com), October 3, 2021 1:22 pm
Room: Moderated Discussions
rwessel (rwessel.delete@this.yahoo.com) on October 3, 2021 12:49 pm wrote:
> Mark Roulo (nothanks.delete@this.xxx.com) on October 3, 2021 12:41 pm wrote:
> > Linus Torvalds (torvalds.delete@this.linux-foundation.org) on October 3, 2021 11:09 am wrote:
> > > Doug S (foo.delete@this.bar.bar) on October 3, 2021 10:09 am wrote:
> > > >
> > > > Zeroing has room for optimization, both since you will often zero more than one page at a time and
> > > > because zeroes are rarely read before they are overwritten - so you want that activity to occur outside
> > > > of the cache.
> > >
> > > No you don't, actually.
> > >
> > > People have tried various pre-zeroing schemes over and over
> > > and over again, and it's always been a loss in the end.
> > >
> > > Why? Caches work, and they grow over time. And basically every single time you zero
> > > something, you are doing so because you're going to access much of the end result -
> > > even if it's just to overwrite it with final data - in the not too distant future.
> > >
> > > Pre-zeroing and doing it at a DRAM or memory controller level is always going to be the wrong answer.
> > > It's going to mean that when you access it, you're now going to take that very expensive cache miss.
> > >
> > > Yes, you can always find benchmarks where pre-zeroing is great, because you can
> > > pick the benchmark where you have just the right working set size, and you can time
> > > the memory operations to when they are most effective for that benchmark.
> > >
> > > And then on real loads it won't work at all. In fact, even on the benchmark it will be a loss on
> > > other microarchitecures with bigger caches - so you're basically pessimising for the future.
> > >
> > > So what you want to do is to zero your memory basically as late as possible, just before it gets used. That
> > > way the data will be close when it is accessed. Even if
> > > it's accessed just for writing the actual new data on
> > > top - a lot of zeroing is for initialization and security reasons, and to make for consistent behavior - it
> > > will be at least already dirty and exclusive in your caches, which is exactly what you want for a write.
> > >
> > > So for big sparse arrays (or huge initial allocations), you may actually be much better
> > > off allocating them with something like a "mmap()" interface for anonymous memory (pick
> > > whatever non-unix equivalent), and just telling the system that you will need this much
> > > memory, but then depend on demand-paging to zero the pages for you before use.
> > >
> > > Yes, you'll then take the page faults dynamically, but it might well end up
> > > much better than pre-zeroing big buffers that you won't use for a while.
> > >
> > > As a rule of thumb, you never ever want to move memory accesses closer to DRAM,
> > > unless you have been explicitly told "I don't want this data any more" (or you have
> > > some really good detection of "this working set won't fit in any caches").
> > >
> > > DRAM is just too far away, and caches are too effective - and you
> > > very seldom know how much cache you have on a software level.
> > >
> > > Side note: that detection of "this working set won't fit in any caches" may well be
> > > about the CPU knowing the size of a memory copy or memory clear operation ahead of time,
> > > and taking those kinds of very explicit hints into account. Which is just another reason
> > > you should have memory copy support in hardware, and not do it in software.
> > >
> > > Linus
> >
> > How confident are you about this for languages such as Java? Especially
> > Java running in a server context with sophisticated multi-threaded GC?
> >
> > This seems like the sort of environment where zero-ing the data to be allocated later might be
> > a win, especially if the data could be zero-d when nothing else needed the DRAM bandwidth.
>
>
> Punting that off to a different core would work too. Using one side of a multi-threaded
> CPU in something so thoroughly memory bound would probably have only modest impact on the
> other side. nd that extra core would be useful at time when you're not zeroing memory.
This puts pressure on the DRAM bandwidth in a way that (a) zero-ing when the bus wasn't needed and (b) zero-ing internally do not.
Maybe your typical Java server load won't see any improvement because of (a) or (b), though. I don't know. Maybe it doesn't matter and the more current JVMs already do this.
[I do know that I have a load where being able to assume 2MB at a time of memory was starting off zero-d would be great! But I also know that this is niche (and in my case not even a Java load). And I am bandwidth limited, so it would matter, too.]
> Mark Roulo (nothanks.delete@this.xxx.com) on October 3, 2021 12:41 pm wrote:
> > Linus Torvalds (torvalds.delete@this.linux-foundation.org) on October 3, 2021 11:09 am wrote:
> > > Doug S (foo.delete@this.bar.bar) on October 3, 2021 10:09 am wrote:
> > > >
> > > > Zeroing has room for optimization, both since you will often zero more than one page at a time and
> > > > because zeroes are rarely read before they are overwritten - so you want that activity to occur outside
> > > > of the cache.
> > >
> > > No you don't, actually.
> > >
> > > People have tried various pre-zeroing schemes over and over
> > > and over again, and it's always been a loss in the end.
> > >
> > > Why? Caches work, and they grow over time. And basically every single time you zero
> > > something, you are doing so because you're going to access much of the end result -
> > > even if it's just to overwrite it with final data - in the not too distant future.
> > >
> > > Pre-zeroing and doing it at a DRAM or memory controller level is always going to be the wrong answer.
> > > It's going to mean that when you access it, you're now going to take that very expensive cache miss.
> > >
> > > Yes, you can always find benchmarks where pre-zeroing is great, because you can
> > > pick the benchmark where you have just the right working set size, and you can time
> > > the memory operations to when they are most effective for that benchmark.
> > >
> > > And then on real loads it won't work at all. In fact, even on the benchmark it will be a loss on
> > > other microarchitecures with bigger caches - so you're basically pessimising for the future.
> > >
> > > So what you want to do is to zero your memory basically as late as possible, just before it gets used. That
> > > way the data will be close when it is accessed. Even if
> > > it's accessed just for writing the actual new data on
> > > top - a lot of zeroing is for initialization and security reasons, and to make for consistent behavior - it
> > > will be at least already dirty and exclusive in your caches, which is exactly what you want for a write.
> > >
> > > So for big sparse arrays (or huge initial allocations), you may actually be much better
> > > off allocating them with something like a "mmap()" interface for anonymous memory (pick
> > > whatever non-unix equivalent), and just telling the system that you will need this much
> > > memory, but then depend on demand-paging to zero the pages for you before use.
> > >
> > > Yes, you'll then take the page faults dynamically, but it might well end up
> > > much better than pre-zeroing big buffers that you won't use for a while.
> > >
> > > As a rule of thumb, you never ever want to move memory accesses closer to DRAM,
> > > unless you have been explicitly told "I don't want this data any more" (or you have
> > > some really good detection of "this working set won't fit in any caches").
> > >
> > > DRAM is just too far away, and caches are too effective - and you
> > > very seldom know how much cache you have on a software level.
> > >
> > > Side note: that detection of "this working set won't fit in any caches" may well be
> > > about the CPU knowing the size of a memory copy or memory clear operation ahead of time,
> > > and taking those kinds of very explicit hints into account. Which is just another reason
> > > you should have memory copy support in hardware, and not do it in software.
> > >
> > > Linus
> >
> > How confident are you about this for languages such as Java? Especially
> > Java running in a server context with sophisticated multi-threaded GC?
> >
> > This seems like the sort of environment where zero-ing the data to be allocated later might be
> > a win, especially if the data could be zero-d when nothing else needed the DRAM bandwidth.
>
>
> Punting that off to a different core would work too. Using one side of a multi-threaded
> CPU in something so thoroughly memory bound would probably have only modest impact on the
> other side. nd that extra core would be useful at time when you're not zeroing memory.
This puts pressure on the DRAM bandwidth in a way that (a) zero-ing when the bus wasn't needed and (b) zero-ing internally do not.
Maybe your typical Java server load won't see any improvement because of (a) or (b), though. I don't know. Maybe it doesn't matter and the more current JVMs already do this.
[I do know that I have a load where being able to assume 2MB at a time of memory was starting off zero-d would be great! But I also know that this is niche (and in my case not even a Java load). And I am bandwidth limited, so it would matter, too.]