By: rwessel (robertwessel.delete@this.yahoo.com), January 29, 2017 8:56 pm
Room: Moderated Discussions
Linus Torvalds (torvalds.delete@this.linux-foundation.org) on January 28, 2017 11:49 am wrote:
> David Kanter (dkanter.delete@this.realworldtech.com) on January 27, 2017 6:46 am wrote:
> >
> > The TLB is a good example. It is larger in servers do deal with
> > larger data footprints. But TLBs are power-hungry and large.
>
> I'm not actually sure that the TLB is all that good an example.
>
> There are various trade-offs wrt TLB's, and "larger data footprints" is only one issue.
>
> Another (big) issue for TLB's is interface and usage, and latency of the
> inevitable misses (and lots of misses are compulsory, not capacity).
>
> To be specific, for Intel the traditional TLB lookup model was (and that traditional
> model still affects things today, even if it's slightly modified):
>
> (a) no address space ID's
>
> (b) very frequent TLB flushes
>
> where (a) is simply from historical interface reasons and (b) is from how the traditional win32 GDI subsystem
> worked (and other workarounds for things like PAE) which just caused the TLB to be flushed all the time.
>
> Now, the naive assumption is that "if the TLB gets flushed very often, then a
> small TLB makes sense". But that's not actually really all that happened.
>
> What happened is that because of the constant TLB flushes, the TLB re-population simply got a lot more
> critical for Intel (and AMD) CPU's than it tends to be for the "server only" CPU's that you mention.
>
> If you are a POWER architect, the easy way to make the TLB more effective is to
> just make it much bigger. You are running only big jobs that almost never flush the
> TLB, so making the TLB bigger is a simply no-brainer. Same is true on zArch.
>
> But I really want to stress that "no brainer" part. It's not a "clever" approach. It's also not something
> that should be seen as a good thing and be seen as "big box serious hardware for real men and women".
> It's the stupid approach that just happens to work pretty well under the kinds of boring and big
> - but otherwise pretty well-behaved - applications that those CPU's were mostly running.
>
> The thing is, the POWER page tables are traditionally horrible nasty shit, and TLB misses take a long time and
> need software support to fill in the hashed page tables. The thing is just bad. It's a really bad design,
> and the big TLB's are pure brawn with no brains. I'm not as familiar with s390, so I'm not going to say horrible
> things about that, but I do want to stress that "big TLB" is not necessarily a good thing in itself.
>
> Because the other side of the coin really is how well you can fill that TLB.
>
> And if you are better at filling the TLB, you may simply need a smaller TLB in practice.
>
> Put another way: I'd rather take a smaller, smarter, low-latency cache that is coupled
> with a smart memory subsystem that can have multiple outstanding misses and does a good
> job of prefetching and not stalling the pipeline in inconvenient situations.
>
> And I think you'd say "Duh! Of course" if you thought about it from the standpoint of the regular L1D$.
>
> I'm saying that the exact same thing is true of the TLB. Size is absolutely not
> everything. "Big" does not automatically equal "good". If everything else is equal,
> big is obviously better, but everything else is very much not equal.
>
> Server loads are in many ways simpler than desktop and mobile loads. A lot of traditional server
> loads can be handled by just putting "more" of everything. More cores. Bigger caches and TLB's.
> More memory. More, more, more. But seldom "clever". Brute force over nimble and smart. I don't
> think people call some of those things dinosaurs just because they are old.
That's at least partially untrue for zArch. Context switches are very common in MVS*, as many things happen via calls to other address spaces (most often mediated by the zArch equivalent of a call gate**). zArch's TLBs (and specifically the two level TLBs) are there to avoid needing TLB reloads precisely because address space context switches have been increasing significantly in frequency for the last three decades.
Much of GDI in Widnows was moved back into the kernel, back in the NT4 days (although some has moved back out), because the resulting number of micro-kernel-ish address space context switches was too painful. But in general, efficient cross-address space calls has a major issue for MVS for decades.
As to how the page tables are organized, it's very little like POWER, and quite similar to x86 in gross terms (every detail is different, of course). It's a fairly hierarchical table of 2-5 levels (allowing 2GB, 4TB, 8PB or 16EB address spaces - the upper four levels each map 11 bits, the lowest 8 bits, mainly for historical reasons***). There is some variability in what each entry maps (a higher level table can specify that the lower level table its referencing only covers a subset of its nominal range - for example, a second level table can specify that the subordinate third level table only maps the second and third pages, instead of the four it would nominally contain, thus providing translations only for the middle two quarters of the region). There are some other details, not least a specific tag for common (shared) table entries ("segments"). The TLB is defined so that the origin of the page table is essentially the ASID, so translations from multiple address spaces can coexist in the TLB, and TLB entries, at least for the lower levels, that come from the same physical page are largely assumed to be common. That obviously impacts code that needs to invalidate TLB entries, since the TLB can retain entries for tables not currently attached. A variety of flush instructions are provided. There is, of course, a fair bit of other cruft in there, as well as ties to the address space authorization and call gate mechanisms.
While fast TLB reloads are certainly a good thing, they are still an expense, avoiding unnecessary discards of TLB entries, especially on short switches to different address spaces, compliments that. Just like not flushing an L1 for a short call to a different address space is good, no matter how fast the reloads are.
Virtualization also benefits from the large TLBs, again, avoiding discarding a lot of TLB data on VM switches.
*Yeah, yeah, zOS. Just be glad I've stopping calling it Ozzy.
**Unlike most attempts to provide a call-gate-like function, zArch call-gates are actually quite heavily used in MVS system and application code (in application code usually hidden in a library). OTOH, the S/370 "program call" mechanism (and its evolutions under XA, ESA and zArch) was specifically designed to support MVS.
***The notion of a 1MB "segment" is baked into a lot of MVS system and application code. Just like the notion of a 4KB page.
> David Kanter (dkanter.delete@this.realworldtech.com) on January 27, 2017 6:46 am wrote:
> >
> > The TLB is a good example. It is larger in servers do deal with
> > larger data footprints. But TLBs are power-hungry and large.
>
> I'm not actually sure that the TLB is all that good an example.
>
> There are various trade-offs wrt TLB's, and "larger data footprints" is only one issue.
>
> Another (big) issue for TLB's is interface and usage, and latency of the
> inevitable misses (and lots of misses are compulsory, not capacity).
>
> To be specific, for Intel the traditional TLB lookup model was (and that traditional
> model still affects things today, even if it's slightly modified):
>
> (a) no address space ID's
>
> (b) very frequent TLB flushes
>
> where (a) is simply from historical interface reasons and (b) is from how the traditional win32 GDI subsystem
> worked (and other workarounds for things like PAE) which just caused the TLB to be flushed all the time.
>
> Now, the naive assumption is that "if the TLB gets flushed very often, then a
> small TLB makes sense". But that's not actually really all that happened.
>
> What happened is that because of the constant TLB flushes, the TLB re-population simply got a lot more
> critical for Intel (and AMD) CPU's than it tends to be for the "server only" CPU's that you mention.
>
> If you are a POWER architect, the easy way to make the TLB more effective is to
> just make it much bigger. You are running only big jobs that almost never flush the
> TLB, so making the TLB bigger is a simply no-brainer. Same is true on zArch.
>
> But I really want to stress that "no brainer" part. It's not a "clever" approach. It's also not something
> that should be seen as a good thing and be seen as "big box serious hardware for real men and women".
> It's the stupid approach that just happens to work pretty well under the kinds of boring and big
> - but otherwise pretty well-behaved - applications that those CPU's were mostly running.
>
> The thing is, the POWER page tables are traditionally horrible nasty shit, and TLB misses take a long time and
> need software support to fill in the hashed page tables. The thing is just bad. It's a really bad design,
> and the big TLB's are pure brawn with no brains. I'm not as familiar with s390, so I'm not going to say horrible
> things about that, but I do want to stress that "big TLB" is not necessarily a good thing in itself.
>
> Because the other side of the coin really is how well you can fill that TLB.
>
> And if you are better at filling the TLB, you may simply need a smaller TLB in practice.
>
> Put another way: I'd rather take a smaller, smarter, low-latency cache that is coupled
> with a smart memory subsystem that can have multiple outstanding misses and does a good
> job of prefetching and not stalling the pipeline in inconvenient situations.
>
> And I think you'd say "Duh! Of course" if you thought about it from the standpoint of the regular L1D$.
>
> I'm saying that the exact same thing is true of the TLB. Size is absolutely not
> everything. "Big" does not automatically equal "good". If everything else is equal,
> big is obviously better, but everything else is very much not equal.
>
> Server loads are in many ways simpler than desktop and mobile loads. A lot of traditional server
> loads can be handled by just putting "more" of everything. More cores. Bigger caches and TLB's.
> More memory. More, more, more. But seldom "clever". Brute force over nimble and smart. I don't
> think people call some of those things dinosaurs just because they are old.
That's at least partially untrue for zArch. Context switches are very common in MVS*, as many things happen via calls to other address spaces (most often mediated by the zArch equivalent of a call gate**). zArch's TLBs (and specifically the two level TLBs) are there to avoid needing TLB reloads precisely because address space context switches have been increasing significantly in frequency for the last three decades.
Much of GDI in Widnows was moved back into the kernel, back in the NT4 days (although some has moved back out), because the resulting number of micro-kernel-ish address space context switches was too painful. But in general, efficient cross-address space calls has a major issue for MVS for decades.
As to how the page tables are organized, it's very little like POWER, and quite similar to x86 in gross terms (every detail is different, of course). It's a fairly hierarchical table of 2-5 levels (allowing 2GB, 4TB, 8PB or 16EB address spaces - the upper four levels each map 11 bits, the lowest 8 bits, mainly for historical reasons***). There is some variability in what each entry maps (a higher level table can specify that the lower level table its referencing only covers a subset of its nominal range - for example, a second level table can specify that the subordinate third level table only maps the second and third pages, instead of the four it would nominally contain, thus providing translations only for the middle two quarters of the region). There are some other details, not least a specific tag for common (shared) table entries ("segments"). The TLB is defined so that the origin of the page table is essentially the ASID, so translations from multiple address spaces can coexist in the TLB, and TLB entries, at least for the lower levels, that come from the same physical page are largely assumed to be common. That obviously impacts code that needs to invalidate TLB entries, since the TLB can retain entries for tables not currently attached. A variety of flush instructions are provided. There is, of course, a fair bit of other cruft in there, as well as ties to the address space authorization and call gate mechanisms.
While fast TLB reloads are certainly a good thing, they are still an expense, avoiding unnecessary discards of TLB entries, especially on short switches to different address spaces, compliments that. Just like not flushing an L1 for a short call to a different address space is good, no matter how fast the reloads are.
Virtualization also benefits from the large TLBs, again, avoiding discarding a lot of TLB data on VM switches.
*Yeah, yeah, zOS. Just be glad I've stopping calling it Ozzy.
**Unlike most attempts to provide a call-gate-like function, zArch call-gates are actually quite heavily used in MVS system and application code (in application code usually hidden in a library). OTOH, the S/370 "program call" mechanism (and its evolutions under XA, ESA and zArch) was specifically designed to support MVS.
***The notion of a 1MB "segment" is baked into a lot of MVS system and application code. Just like the notion of a 4KB page.