By: Linus Torvalds (torvalds.delete@this.linux-foundation.org), April 28, 2017 10:12 am
Room: Moderated Discussions
wumpus (lost.delete@this.in-a.cave.net) on April 27, 2017 8:40 pm wrote:
> Linus Torvalds (torvalds.delete@this.linux-foundation.org) on April 27, 2017 4:54 pm wrote:
> >
> > So in that model the basic (traditional) run-ahead model is still based on an in-order
> > core, but with checkpointing and speculative execution to get the memory parallelism.
> >
> > And I don't believe in that model, because it's not useful for the cache hit case that
> > I already talked about.
>
> I can't see why the granularity matters for your "cache hit problem". Runahead certainly has
> granularity issues, but they come into play with branch mispredictions and exceptions.
The granularity matters, because you cannot afford to use run-ahead for each memory access that hits in the L1 cache - so run-ahead will not make you able to pipeline the L1 accesses to the two A/B loads in "load-A-use-A--load-B" (unless the basic engine you do run-ahead on is OoO).
So run-ahead doesn't work at that scale - the cost or checkpoint/restart is high enough that you can only reasonably do it at the scale of "big" events (ie a cache miss, not a hit).
Of course, in theory anything is possible. You could take my example, posit a really small L0 cache that has a single-cycle latency, turn my "L1 hit" example and say that it's a L0 miss that causes runahead that does the pipelined L1->L0 filling, and then restarts with the single-cycle A/B accesses and thus you actually did end up pipelining the 3-4 cycles it takes to do the L1 lookup.
But I can pretty much guarantee that that will be way more expensive than any small-scale OoO that just simply does the sane thing would ever be.
And cache hit pipelining matters.
Yes, cache misses are really really expensive, and you need to handle them specially, and even OoO alone isn't even remotely sufficient for best performance (prefetching, and truly speculative accesses to fill caches early). But most memory accesses - by far - are still going to hit in the cache, and how well you handle L1 and L2 hits matters a lot.
The people who think that it's only about the memory are wrong. It's very much also about the cache hit. Pipelining L1 accesses - and even more so L2 accesses - is a primary performance issue on many loads. You can not afford to wait for one cache access to finish before starting the next one. Anybody who tells you "caches are cheap" isn't really aware of reality.
> You can worry about granularity when the [non-running ahead] program counter sweeps
> past a granularity boundary (and you have to retire a bunch of instructions).
Ok, you and I are talking about different levels of runahead. You use the "two full threads" model, which I consider to be unacceptable for other reasons (doing everything twice). I don't believe that can ever compete realistically because it's basically wasting enormous amounts of resources that could be better utilized by running actual threads.
My runahead model is the simpler "one single thread that just turns into the runahead thread when it stalls in the non-speculative mode".
I think my single-thread model is the one that people consider the basic run-ahead model. The "waste thread resources" model is the one that the Sparc people tried, but that thing was so terminally broken that it's not even interesting, imho.
Anyway, I don't think run-ahead is seriously considered viable at that level.
Linus
> Linus Torvalds (torvalds.delete@this.linux-foundation.org) on April 27, 2017 4:54 pm wrote:
> >
> > So in that model the basic (traditional) run-ahead model is still based on an in-order
> > core, but with checkpointing and speculative execution to get the memory parallelism.
> >
> > And I don't believe in that model, because it's not useful for the cache hit case that
> > I already talked about.
>
> I can't see why the granularity matters for your "cache hit problem". Runahead certainly has
> granularity issues, but they come into play with branch mispredictions and exceptions.
The granularity matters, because you cannot afford to use run-ahead for each memory access that hits in the L1 cache - so run-ahead will not make you able to pipeline the L1 accesses to the two A/B loads in "load-A-use-A--load-B" (unless the basic engine you do run-ahead on is OoO).
So run-ahead doesn't work at that scale - the cost or checkpoint/restart is high enough that you can only reasonably do it at the scale of "big" events (ie a cache miss, not a hit).
Of course, in theory anything is possible. You could take my example, posit a really small L0 cache that has a single-cycle latency, turn my "L1 hit" example and say that it's a L0 miss that causes runahead that does the pipelined L1->L0 filling, and then restarts with the single-cycle A/B accesses and thus you actually did end up pipelining the 3-4 cycles it takes to do the L1 lookup.
But I can pretty much guarantee that that will be way more expensive than any small-scale OoO that just simply does the sane thing would ever be.
And cache hit pipelining matters.
Yes, cache misses are really really expensive, and you need to handle them specially, and even OoO alone isn't even remotely sufficient for best performance (prefetching, and truly speculative accesses to fill caches early). But most memory accesses - by far - are still going to hit in the cache, and how well you handle L1 and L2 hits matters a lot.
The people who think that it's only about the memory are wrong. It's very much also about the cache hit. Pipelining L1 accesses - and even more so L2 accesses - is a primary performance issue on many loads. You can not afford to wait for one cache access to finish before starting the next one. Anybody who tells you "caches are cheap" isn't really aware of reality.
> You can worry about granularity when the [non-running ahead] program counter sweeps
> past a granularity boundary (and you have to retire a bunch of instructions).
Ok, you and I are talking about different levels of runahead. You use the "two full threads" model, which I consider to be unacceptable for other reasons (doing everything twice). I don't believe that can ever compete realistically because it's basically wasting enormous amounts of resources that could be better utilized by running actual threads.
My runahead model is the simpler "one single thread that just turns into the runahead thread when it stalls in the non-speculative mode".
I think my single-thread model is the one that people consider the basic run-ahead model. The "waste thread resources" model is the one that the Sparc people tried, but that thing was so terminally broken that it's not even interesting, imho.
Anyway, I don't think run-ahead is seriously considered viable at that level.
Linus