By: Linus Torvalds (torvalds.delete@this.linux-foundation.org), February 1, 2017 10:17 am
Room: Moderated Discussions
rwessel (robertwessel.delete@this.yahoo.com) on January 31, 2017 5:54 pm wrote:
>
> But I'm curious, when is it handy to know which PTE dirtied (or referenced) a page? All
> I'm coming up with is pretty contrived. My best idea is to generate data to decide to
> migrate a page to a better NUMA node (or conversely, manage thread-to-node affinity).
>
> > In fact, a per-virtual-mapping dirty bit is so useful that we end up not just tracking the
> > usual dirty state (that most CPU's give us in hardware), we end up having a second sw-only
> > dirty state that we call the "soft dirty" state for tracking things like "has this page been
> > changed since we last looked at it", which is useful for things like checkpointing.
>
> MVS does the same at the physical page level, for similar reasons. For the reference bit as well.
So I suspect MVS does a lot less physical page sharing.
Unix, with mmap() and fork() ends up often sharing the page with multiple different virtual mappings, and the mappings can also have entirely different protections (ie the same page might be mapped by both root and a normal user).
So one example actually comes from the fairly recent "Dirty COW" security issue: we actually want to check whether this particular mapping already dirtied the page or not. The code did something else exactly because of the s390 difference (ie the dirty bit wasn't actually per-mapping on s390, so the obvious test didn't work there), and that was one of the causes of the security bug.
Other causes tend to be about various TLB shootdown things: if the page table entry is dirty for the mapping that is being torn down, that is different from the page itself being dirty: a dirty page is meaningless from TLB shootdown, but a dirty mapping of a page means that there may still be writing activity going on in that thread, and the TLB needs to be invalidated synchronously (to make sure that the dirty state is in sync with the dirty state of the page) rather than being batched up for later. And batching up TLB invalidates is actually kind of a big deal, so you don't want to do it in general. Ergo: check the dirty bit.
So there are these kinds of details where the per-mapping dirty bit is useful. They're not huge, but when there is one silly unusual architecture that does things differently from all the rest, that can be very annoying when you share code and a lot of logic (and trust me, you absolutely want to share details like page table teardown and TLB shootdown across architectures - it's really really easy to get these things wrong).
So I doubt MVS people or other s390-only users ever cared about the fact that s390 is different. The s390-specific projects could always work around the idiosyncrasies (or even take advantage of them). But a fairly portable project like Linux that support 30 different architectures ends up having these kinds of issues.
> > > But yes, the biggest annoyance was just that we share the VM(code) No disagreement.
>
> Of course, the number of Linux images zArch in use is unlikely
> to drive major internal changes to the kernel.
It turns out that the s390 people made their page tables have a SW-managed dirty bit, and that solved these annoyances.
> BTW, I'm sure your post went wonky after the (code) (where you used angle brackets), probably up to some point
> where he found something else that looked like a tag.
Yeah, very possible. I detest editing html, it's not meant for humans. This site would be better off with markdown or something. Whatever.
> I'm surprised that you can't make use of some of that under the hood for zArch. I assume that somewhere
> there's a "copy space-A/address-B to space-C/address-D for length E" function.
Yes, we might be able to. It's not heavily used, though, because it's a special Linux-only system call that doesn't exist in POSIX, so "not very commonly used" + "s390 isn't that common" = "nobody has ever even looked at it".
Linus
>
> But I'm curious, when is it handy to know which PTE dirtied (or referenced) a page? All
> I'm coming up with is pretty contrived. My best idea is to generate data to decide to
> migrate a page to a better NUMA node (or conversely, manage thread-to-node affinity).
>
> > In fact, a per-virtual-mapping dirty bit is so useful that we end up not just tracking the
> > usual dirty state (that most CPU's give us in hardware), we end up having a second sw-only
> > dirty state that we call the "soft dirty" state for tracking things like "has this page been
> > changed since we last looked at it", which is useful for things like checkpointing.
>
> MVS does the same at the physical page level, for similar reasons. For the reference bit as well.
So I suspect MVS does a lot less physical page sharing.
Unix, with mmap() and fork() ends up often sharing the page with multiple different virtual mappings, and the mappings can also have entirely different protections (ie the same page might be mapped by both root and a normal user).
So one example actually comes from the fairly recent "Dirty COW" security issue: we actually want to check whether this particular mapping already dirtied the page or not. The code did something else exactly because of the s390 difference (ie the dirty bit wasn't actually per-mapping on s390, so the obvious test didn't work there), and that was one of the causes of the security bug.
Other causes tend to be about various TLB shootdown things: if the page table entry is dirty for the mapping that is being torn down, that is different from the page itself being dirty: a dirty page is meaningless from TLB shootdown, but a dirty mapping of a page means that there may still be writing activity going on in that thread, and the TLB needs to be invalidated synchronously (to make sure that the dirty state is in sync with the dirty state of the page) rather than being batched up for later. And batching up TLB invalidates is actually kind of a big deal, so you don't want to do it in general. Ergo: check the dirty bit.
So there are these kinds of details where the per-mapping dirty bit is useful. They're not huge, but when there is one silly unusual architecture that does things differently from all the rest, that can be very annoying when you share code and a lot of logic (and trust me, you absolutely want to share details like page table teardown and TLB shootdown across architectures - it's really really easy to get these things wrong).
So I doubt MVS people or other s390-only users ever cared about the fact that s390 is different. The s390-specific projects could always work around the idiosyncrasies (or even take advantage of them). But a fairly portable project like Linux that support 30 different architectures ends up having these kinds of issues.
> > > But yes, the biggest annoyance was just that we share the VM(code) No disagreement.
>
> Of course, the number of Linux images zArch in use is unlikely
> to drive major internal changes to the kernel.
It turns out that the s390 people made their page tables have a SW-managed dirty bit, and that solved these annoyances.
> BTW, I'm sure your post went wonky after the (code) (where you used angle brackets), probably up to some point
> where he found something else that looked like a tag.
Yeah, very possible. I detest editing html, it's not meant for humans. This site would be better off with markdown or something. Whatever.
> I'm surprised that you can't make use of some of that under the hood for zArch. I assume that somewhere
> there's a "copy space-A/address-B to space-C/address-D for length E" function.
Yes, we might be able to. It's not heavily used, though, because it's a special Linux-only system call that doesn't exist in POSIX, so "not very commonly used" + "s390 isn't that common" = "nobody has ever even looked at it".
Linus