By: Linus Torvalds (torvalds.delete@this.linux-foundation.org), January 28, 2017 12:49 pm
Room: Moderated Discussions
David Kanter (dkanter.delete@this.realworldtech.com) on January 27, 2017 6:46 am wrote:
>
> The TLB is a good example. It is larger in servers do deal with
> larger data footprints. But TLBs are power-hungry and large.
I'm not actually sure that the TLB is all that good an example.
There are various trade-offs wrt TLB's, and "larger data footprints" is only one issue.
Another (big) issue for TLB's is interface and usage, and latency of the inevitable misses (and lots of misses are compulsory, not capacity).
To be specific, for Intel the traditional TLB lookup model was (and that traditional model still affects things today, even if it's slightly modified):
(a) no address space ID's
(b) very frequent TLB flushes
where (a) is simply from historical interface reasons and (b) is from how the traditional win32 GDI subsystem worked (and other workarounds for things like PAE) which just caused the TLB to be flushed all the time.
Now, the naive assumption is that "if the TLB gets flushed very often, then a small TLB makes sense". But that's not actually really all that happened.
What happened is that because of the constant TLB flushes, the TLB re-population simply got a lot more critical for Intel (and AMD) CPU's than it tends to be for the "server only" CPU's that you mention.
If you are a POWER architect, the easy way to make the TLB more effective is to just make it much bigger. You are running only big jobs that almost never flush the TLB, so making the TLB bigger is a simply no-brainer. Same is true on zArch.
But I really want to stress that "no brainer" part. It's not a "clever" approach. It's also not something that should be seen as a good thing and be seen as "big box serious hardware for real men and women". It's the stupid approach that just happens to work pretty well under the kinds of boring and big - but otherwise pretty well-behaved - applications that those CPU's were mostly running.
The thing is, the POWER page tables are traditionally horrible nasty shit, and TLB misses take a long time and need software support to fill in the hashed page tables. The thing is just bad. It's a really bad design, and the big TLB's are pure brawn with no brains. I'm not as familiar with s390, so I'm not going to say horrible things about that, but I do want to stress that "big TLB" is not necessarily a good thing in itself.
Because the other side of the coin really is how well you can fill that TLB.
And if you are better at filling the TLB, you may simply need a smaller TLB in practice.
Put another way: I'd rather take a smaller, smarter, low-latency cache that is coupled with a smart memory subsystem that can have multiple outstanding misses and does a good job of prefetching and not stalling the pipeline in inconvenient situations.
And I think you'd say "Duh! Of course" if you thought about it from the standpoint of the regular L1D$.
I'm saying that the exact same thing is true of the TLB. Size is absolutely not everything. "Big" does not automatically equal "good". If everything else is equal, big is obviously better, but everything else is very much not equal.
Server loads are in many ways simpler than desktop and mobile loads. A lot of traditional server loads can be handled by just putting "more" of everything. More cores. Bigger caches and TLB's. More memory. More, more, more. But seldom "clever". Brute force over nimble and smart. I don't think people call some of those things dinosaurs just because they are old.
Linus
>
> The TLB is a good example. It is larger in servers do deal with
> larger data footprints. But TLBs are power-hungry and large.
I'm not actually sure that the TLB is all that good an example.
There are various trade-offs wrt TLB's, and "larger data footprints" is only one issue.
Another (big) issue for TLB's is interface and usage, and latency of the inevitable misses (and lots of misses are compulsory, not capacity).
To be specific, for Intel the traditional TLB lookup model was (and that traditional model still affects things today, even if it's slightly modified):
(a) no address space ID's
(b) very frequent TLB flushes
where (a) is simply from historical interface reasons and (b) is from how the traditional win32 GDI subsystem worked (and other workarounds for things like PAE) which just caused the TLB to be flushed all the time.
Now, the naive assumption is that "if the TLB gets flushed very often, then a small TLB makes sense". But that's not actually really all that happened.
What happened is that because of the constant TLB flushes, the TLB re-population simply got a lot more critical for Intel (and AMD) CPU's than it tends to be for the "server only" CPU's that you mention.
If you are a POWER architect, the easy way to make the TLB more effective is to just make it much bigger. You are running only big jobs that almost never flush the TLB, so making the TLB bigger is a simply no-brainer. Same is true on zArch.
But I really want to stress that "no brainer" part. It's not a "clever" approach. It's also not something that should be seen as a good thing and be seen as "big box serious hardware for real men and women". It's the stupid approach that just happens to work pretty well under the kinds of boring and big - but otherwise pretty well-behaved - applications that those CPU's were mostly running.
The thing is, the POWER page tables are traditionally horrible nasty shit, and TLB misses take a long time and need software support to fill in the hashed page tables. The thing is just bad. It's a really bad design, and the big TLB's are pure brawn with no brains. I'm not as familiar with s390, so I'm not going to say horrible things about that, but I do want to stress that "big TLB" is not necessarily a good thing in itself.
Because the other side of the coin really is how well you can fill that TLB.
And if you are better at filling the TLB, you may simply need a smaller TLB in practice.
Put another way: I'd rather take a smaller, smarter, low-latency cache that is coupled with a smart memory subsystem that can have multiple outstanding misses and does a good job of prefetching and not stalling the pipeline in inconvenient situations.
And I think you'd say "Duh! Of course" if you thought about it from the standpoint of the regular L1D$.
I'm saying that the exact same thing is true of the TLB. Size is absolutely not everything. "Big" does not automatically equal "good". If everything else is equal, big is obviously better, but everything else is very much not equal.
Server loads are in many ways simpler than desktop and mobile loads. A lot of traditional server loads can be handled by just putting "more" of everything. More cores. Bigger caches and TLB's. More memory. More, more, more. But seldom "clever". Brute force over nimble and smart. I don't think people call some of those things dinosaurs just because they are old.
Linus