By: Linus Torvalds (torvalds.delete@this.linux-foundation.org), September 29, 2009 6:44 am
Room: Moderated Discussions
Anders Jensen (@.) on 9/29/09 wrote:
>
>Putting instructions in parallel however is probably
>cheaper in software and extremely expensive in hardware,
>sure you will miss out on the last 10% of optimization
>or so, but that is the healthy sign you should look for
>in complex optimization.
You're overcomplicating the thing.
You don't even need any complex parallel decoding (which
really is quite complex for something like x86): OoO can
be done (and has been done) on a much simpler scale.
For example, you might be much better off with a single
instruction per cycle decoder tied into an OoO core, than
with a two-instruction paidable "UV pipe" like Intel has
in the Pentium (and Atom?) that has some fairly strict
pairability rules.
Of course, I wouldn't actually expect Intel to ever do
anything like that. A more likely situation is the kind
of traditional Intel decoder, which does one complex
instruction per cycle, but can do several ones if they
are simple.
And whatever you do at the front-end, you certainly don't
need to do any complex and "extremely expensive"
parallelism anywhere else. Tomasulo is neither very complex
nor extremely expensive.
>Still I'm not sure it is worth doing Tomasulo just to get
>this last part. Doing multipass pipelining will probably
>get you some gains that runahead leaves behind in this
>respect, but it will definitively cost you some
>throughput/watt.
I doubt that a superscalar highly pipelined approach is
all that much more simpler than Tomasulo with just a
reservation station per unit. I also think your "corner
cases" are oddly chosen:
>Complex problems never has simple solutions and I have yet
>to find a complex question where the best solution is a
>corner case. For CPU uarch typical corner cases would be
>in-order and Out-of-order.
That's just total bullsh*t.
The corner cases aren't "in-order" vs "out-of-order" at all.
There's a lot of details you are skipping, and the
complexity of in-order easily overlaps with the complexity
of out-of-order when you start looking at those details.
What you trivially just call in-order is a whole spectrum
of possible complexities, with ranges of pipeline depth
(and the inevitable forwarding) and super-scalar. Add to
that blocking vs nonblocking cache accesses etc etc.
Similarly, what you just dismiss as the "corner case" of
being OoO is not a corner case at all, but another whole
spectrum of implementations, ranging from some fairly
simple Tomasulo with just one reservation per unit to having
some rather extreme instruction window depths of tens (or
hundreds) of instructions etc.
And then you have the whole "speculative execution" which
you can do in both cases, and the question is just how
far you push it (in OoO you can push it much further - but
you don't have to).
So while I agree with you that the solution is never the
extremes ("corner cases") I fundamentally think you then
totally went off the deep end by saying that "in-order" vs
"out-of-order" are some kind of corner cases. You can't
make that kind of insane simplifications.
And the thing is, in-order hits a huge complexity and
performance wall. I can't recall anybody ever having done
more than two-way superscalar, even on things that are
much easier to decode than x86. There's just not enough of
an upside to the complexity (you'd do WLIV to avoid the
complexity, but that has its own downsides, both in future
designs and in I$ costs).
I also suspect that things like SMT (which Atom supports)
are a whole lot more natural in an OoO environment. The
pipeline just isn't as rigid. So you can't just compare
some unnamed in-order implementation with some other OoO
one and say that the in-order one is simpler - you have to
state what the performance requirements are.
Is a single-scalar in-order CPU without HT much simpler than
even the simplest OoO core? Oh, sure. Nobody will claim
otherwise. But try to make it perform better, and you'll
start seeing huge complexities - to the point where you
simply have to either say "no more performance", or you'd
create a monster that is much more complex than the
equivalently performing out-of-order implementation.
See? Nobody makes those insane in-order ones, because at
some point it just becomes much simpler to do out-of-order,
and get better performance much more naturally than by
trying to push the in-order thing.
Sure, you can push in-order. You can try to push it with
run-ahead threads, you can do SMT, you can do a lot of
those things. But in the end, at some point you'll either
find that your chip is more complex than a simple OoO
implementation would have been at equivalent performance,
or you'll just say "I'll stop here, and suck".
Calling out-of-order some "corner case" is ludicrous.
Linus
>
>Putting instructions in parallel however is probably
>cheaper in software and extremely expensive in hardware,
>sure you will miss out on the last 10% of optimization
>or so, but that is the healthy sign you should look for
>in complex optimization.
You're overcomplicating the thing.
You don't even need any complex parallel decoding (which
really is quite complex for something like x86): OoO can
be done (and has been done) on a much simpler scale.
For example, you might be much better off with a single
instruction per cycle decoder tied into an OoO core, than
with a two-instruction paidable "UV pipe" like Intel has
in the Pentium (and Atom?) that has some fairly strict
pairability rules.
Of course, I wouldn't actually expect Intel to ever do
anything like that. A more likely situation is the kind
of traditional Intel decoder, which does one complex
instruction per cycle, but can do several ones if they
are simple.
And whatever you do at the front-end, you certainly don't
need to do any complex and "extremely expensive"
parallelism anywhere else. Tomasulo is neither very complex
nor extremely expensive.
>Still I'm not sure it is worth doing Tomasulo just to get
>this last part. Doing multipass pipelining will probably
>get you some gains that runahead leaves behind in this
>respect, but it will definitively cost you some
>throughput/watt.
I doubt that a superscalar highly pipelined approach is
all that much more simpler than Tomasulo with just a
reservation station per unit. I also think your "corner
cases" are oddly chosen:
>Complex problems never has simple solutions and I have yet
>to find a complex question where the best solution is a
>corner case. For CPU uarch typical corner cases would be
>in-order and Out-of-order.
That's just total bullsh*t.
The corner cases aren't "in-order" vs "out-of-order" at all.
There's a lot of details you are skipping, and the
complexity of in-order easily overlaps with the complexity
of out-of-order when you start looking at those details.
What you trivially just call in-order is a whole spectrum
of possible complexities, with ranges of pipeline depth
(and the inevitable forwarding) and super-scalar. Add to
that blocking vs nonblocking cache accesses etc etc.
Similarly, what you just dismiss as the "corner case" of
being OoO is not a corner case at all, but another whole
spectrum of implementations, ranging from some fairly
simple Tomasulo with just one reservation per unit to having
some rather extreme instruction window depths of tens (or
hundreds) of instructions etc.
And then you have the whole "speculative execution" which
you can do in both cases, and the question is just how
far you push it (in OoO you can push it much further - but
you don't have to).
So while I agree with you that the solution is never the
extremes ("corner cases") I fundamentally think you then
totally went off the deep end by saying that "in-order" vs
"out-of-order" are some kind of corner cases. You can't
make that kind of insane simplifications.
And the thing is, in-order hits a huge complexity and
performance wall. I can't recall anybody ever having done
more than two-way superscalar, even on things that are
much easier to decode than x86. There's just not enough of
an upside to the complexity (you'd do WLIV to avoid the
complexity, but that has its own downsides, both in future
designs and in I$ costs).
I also suspect that things like SMT (which Atom supports)
are a whole lot more natural in an OoO environment. The
pipeline just isn't as rigid. So you can't just compare
some unnamed in-order implementation with some other OoO
one and say that the in-order one is simpler - you have to
state what the performance requirements are.
Is a single-scalar in-order CPU without HT much simpler than
even the simplest OoO core? Oh, sure. Nobody will claim
otherwise. But try to make it perform better, and you'll
start seeing huge complexities - to the point where you
simply have to either say "no more performance", or you'd
create a monster that is much more complex than the
equivalently performing out-of-order implementation.
See? Nobody makes those insane in-order ones, because at
some point it just becomes much simpler to do out-of-order,
and get better performance much more naturally than by
trying to push the in-order thing.
Sure, you can push in-order. You can try to push it with
run-ahead threads, you can do SMT, you can do a lot of
those things. But in the end, at some point you'll either
find that your chip is more complex than a simple OoO
implementation would have been at equivalent performance,
or you'll just say "I'll stop here, and suck".
Calling out-of-order some "corner case" is ludicrous.
Linus
Topic | Posted By | Date |
---|---|---|
Thoughts and questions on the Cortex A9 | Gabriele Svelto | 2009/09/26 01:46 AM |
Thoughts and questions on the Cortex A9 | none | 2009/09/26 02:27 AM |
Thoughts and questions on the Cortex A9 | jeff | 2009/09/27 04:06 AM |
Thoughts and questions on the Cortex A9 | Michael S | 2009/09/27 04:29 AM |
Thoughts and questions on the Cortex A9 | none | 2009/09/27 05:01 AM |
Thoughts and questions on the Cortex A9 | Howard Chu | 2009/09/27 09:39 AM |
Thoughts and questions on the Cortex A9 | Wilco | 2009/09/27 06:03 AM |
Thoughts and questions on the Cortex A9 | jeff | 2009/09/27 07:00 AM |
Thoughts and questions on the Cortex A9 | a reader | 2009/09/27 07:17 AM |
Thoughts and questions on the Cortex A9 | David Kanter | 2009/09/27 07:37 AM |
Thoughts and questions on the Cortex A9 | a reader | 2009/09/27 07:46 AM |
Thoughts and questions on the Cortex A9 | Mat | 2009/10/01 12:04 PM |
Thoughts and questions on the Cortex A9 | Wilco | 2009/10/01 05:09 PM |
Thoughts and questions on the Cortex A9 | anon | 2009/10/01 07:19 PM |
Thoughts and questions on the Cortex A9 | RagingDragon | 2009/09/28 04:11 PM |
Thoughts and questions on the Cortex A9 | Linus Torvalds | 2009/09/27 08:05 AM |
OOO hw vs SW&in-order hw | no thanks | 2009/09/27 03:47 PM |
OOO hw vs SW&in-order hw | Linus Torvalds | 2009/09/28 05:22 AM |
OOO hw vs SW&in-order hw | ? | 2009/09/28 10:37 AM |
OOO hw vs SW&in-order hw | RagingDragon | 2009/09/28 04:22 PM |
OOO hw vs SW&in-order hw | Megol | 2009/09/29 03:35 AM |
OOO hw vs SW&in-order hw | Anders Jensen | 2009/09/28 10:50 PM |
OOO hw vs SW&in-order hw | Linus Torvalds | 2009/09/29 06:44 AM |
OOO hw vs SW&in-order hw | Mark Roulo | 2009/09/29 08:58 AM |
OOO hw vs SW&in-order hw | Linus Torvalds | 2009/09/29 09:30 AM |
3- and 4-issue in-order CPUs | Mark Roulo | 2009/09/29 10:06 AM |
3- and 4-issue in-order CPUs | Linus Torvalds | 2009/09/29 10:29 AM |
3- and 4-issue in-order CPUs | Gian-Carlo Pascutto | 2009/09/29 11:35 PM |
3- and 4-issue in-order CPUs | Michael S | 2009/09/30 01:01 AM |
OOO hw vs SW&in-order hw | mpx | 2009/09/30 03:14 AM |
OOO hw vs SW&in-order hw | Pun Zu | 2009/10/02 01:44 AM |
OOO hw vs SW&in-order hw | none | 2009/10/02 04:22 AM |
OOO hw vs SW&in-order hw | Linus Torvalds | 2009/10/02 06:11 AM |
OOO hw vs SW&in-order hw | a reader | 2009/10/02 08:30 AM |
OOO hw vs SW&in-order hw | Linus Torvalds | 2009/10/02 08:59 AM |
Moorestown | David Kanter | 2009/10/02 09:59 AM |
What's the difference between Moorestown and Pine Trail cores? | anon | 2009/10/03 07:37 PM |
Moorestown | none | 2009/11/03 03:34 PM |
Moorestown | Anon | 2009/11/04 02:17 PM |
Moorestown | none | 2009/11/05 12:38 AM |
Moorestown | David Kanter | 2009/11/05 03:45 PM |
Moorestown | IntelUser2000 | 2009/11/06 03:17 AM |
Moorestown | Anon | 2009/11/06 12:51 PM |
Moorestown | none | 2009/11/07 06:07 AM |
OOO hw vs SW&in-order hw | Anon | 2009/10/02 06:55 PM |
Cluebat for graphics | David Kanter | 2009/10/02 08:19 PM |
Cluebat for graphics | Anon | 2009/10/03 04:45 PM |
Cluebat for graphics | David Kanter | 2009/10/04 12:57 AM |
Cluebat for graphics | Anon | 2009/10/04 07:15 PM |
Cluebat for graphics | David Kanter | 2009/10/05 02:09 AM |
Cluebat for graphics | Anon | 2009/10/05 02:36 PM |
Cluebat for graphics | David Kanter | 2009/10/05 08:54 PM |
Cluebat for graphics | Anon | 2009/10/06 04:58 PM |
OOO hw vs SW&in-order hw | Linus Torvalds | 2009/10/03 05:58 AM |
OOO hw vs SW&in-order hw | slacker | 2009/10/02 08:11 PM |
Linux graphics drivers | RagingDragon | 2009/10/03 07:27 PM |
Linux graphics drivers | anon | 2009/10/04 06:15 AM |
Linux graphics drivers | none | 2009/10/04 09:12 AM |
Thoughts and questions on the Cortex A9 | jeff | 2009/09/27 05:31 PM |
Thoughts and questions on the Cortex A9 | someone | 2009/09/27 08:30 AM |
Thoughts and questions on the Cortex A9 | none | 2009/09/27 09:09 AM |
Thoughts and questions on the Cortex A9 | Wilco | 2009/09/27 10:35 AM |
Thoughts and questions on the Cortex A9 | someone | 2009/09/27 10:55 AM |
Thoughts and questions on the Cortex A9 | Wilco | 2009/09/28 01:08 AM |
Thoughts and questions on the Cortex A9 | someone | 2009/09/28 04:58 AM |
Thoughts and questions on the Cortex A9 | none | 2009/09/28 05:18 AM |
Thoughts and questions on the Cortex A9 | someone | 2009/09/28 06:35 AM |
Thoughts and questions on the Cortex A9 | Wilco | 2009/09/28 07:25 AM |
Thoughts and questions on the Cortex A9 | Michael S | 2009/09/28 10:02 AM |
Thoughts and questions on the Cortex A9 | Wilco | 2009/09/29 12:35 AM |
Thoughts and questions on the Cortex A9 | Chuck | 2009/09/28 06:15 PM |
samples | AM | 2009/09/27 10:20 PM |
samples | Wilco | 2009/09/28 12:51 AM |
samples | AM | 2009/09/28 03:16 AM |
Shrinks and process tech | David Kanter | 2009/09/29 12:22 AM |
Thoughts and questions on the Cortex A9 | someone | 2009/09/27 10:42 AM |
Thoughts and questions on the Cortex A9 | none | 2009/09/27 11:52 AM |
Atom to stay in-oder or go OoO? | AM | 2009/09/27 10:09 PM |
Atom to stay in-oder or go OoO? | Ungo | 2009/09/28 04:34 AM |
Atom to stay in-oder or go OoO? | a reader | 2009/09/28 09:15 AM |
Atom to stay in-oder or go OoO? | anon | 2009/09/28 06:25 PM |
Atom to stay in-oder or go OoO? | AM | 2009/09/30 02:32 AM |
Atom to stay in-oder or go OoO? | baxeel | 2009/09/30 07:25 AM |
Atom to stay in-oder or go OoO? | AM | 2009/09/30 10:12 PM |
Atom to stay in-oder or go OoO? | Ungo | 2009/10/01 02:00 AM |
Atom to stay in-oder or go OoO? | AM | 2009/10/01 04:08 AM |
Atom to stay in-oder or go OoO? | anonymous | 2009/10/01 04:33 AM |
Atom to stay in-oder or go OoO? | AM | 2009/10/03 06:24 AM |
Atom to stay in-oder or go OoO? | Pun Zu | 2009/10/02 12:30 AM |
Atom to stay in-oder or go OoO? | Ungo | 2009/10/02 12:11 PM |
Atom to stay in-oder or go OoO? | AM | 2009/10/03 06:22 AM |
Atom to stay in-oder or go OoO? | Ungo | 2009/10/03 01:53 PM |
Atom to stay in-oder or go OoO? | AM | 2009/10/04 07:44 AM |
Atom to stay in-oder or go OoO? | David Kanter | 2009/10/04 10:02 PM |
Atom to stay in-oder or go OoO? | AM | 2009/10/05 06:18 AM |
Atom to stay in-oder or go OoO? | David Kanter | 2009/10/05 10:12 AM |
Atom to stay in-oder or go OoO? | AM | 2009/10/06 03:51 AM |
Atom to stay in-oder or go OoO? | anonymous | 2009/10/06 06:58 AM |
Do you have any proof? | David Kanter | 2009/10/06 08:58 AM |
Do you? | AM | 2009/10/06 10:30 PM |
Of course I do! | anonymous | 2009/10/07 04:58 AM |
Thanks :-) | AM | 2009/10/08 02:17 AM |
Thanks :-) | anonymous | 2009/10/08 04:52 AM |
Thanks :-) | AM | 2009/10/09 02:13 AM |
Thanks :-) | anonymous | 2009/10/09 05:03 AM |
Thanks :-) | Foo_ | 2009/10/09 05:47 AM |
Thanks :-) | AM | 2009/10/10 12:15 AM |
That's what I thought... | David Kanter | 2009/10/07 08:00 AM |
That's what I thought... | AM | 2009/10/08 02:26 AM |
That's what I thought... | anonymous | 2009/10/08 05:02 AM |
let's see... | AM | 2009/10/09 02:09 AM |
let's see... | anonymous | 2009/10/09 04:43 AM |
let's see... | AM | 2009/10/09 04:52 AM |
let's see... | anonymous | 2009/10/09 05:15 AM |
let's see... | AM | 2009/10/10 12:18 AM |
Atom to stay in-oder or go OoO? | someone | 2009/09/28 05:09 AM |
I call Troll | hobold | 2009/09/28 03:51 AM |
I call Troll | someone | 2009/09/28 05:15 AM |
OT: categories of motivation in a forum | hobold | 2009/09/29 05:01 AM |
Thoughts and questions on the Cortex A9 | Michael S | 2009/09/28 09:43 AM |
Thoughts and questions on the Cortex A9 | a reader | 2009/09/28 03:12 PM |
Thoughts and questions on the Cortex A9 | someone else | 2009/09/28 11:25 PM |
Why Cortex A9? | hobold | 2009/09/29 06:20 AM |
Why Cortex A9? | someone else | 2009/09/29 09:57 AM |
Why Cortex A9? | Richard Cownie | 2009/09/29 05:09 PM |
Why Cortex A9? | hobold | 2009/09/29 11:38 PM |
Why Cortex A9? | Richard Cownie | 2009/09/30 05:49 AM |
Why Cortex A9? | hobold | 2009/09/30 06:46 AM |
Why Cortex A9? | none | 2009/09/30 06:56 AM |
Marvell Sheeva and plug computing | Richard Cownie | 2009/09/30 08:03 AM |
Why Cortex A9? | Michael S | 2009/09/30 09:07 AM |
Why Cortex A9? | none | 2009/09/30 09:40 AM |
Why Cortex A9? | Gabriele Svelto | 2009/09/30 11:43 AM |
ARM architectural license | David Kanter | 2009/09/30 04:57 PM |
ARM architectural license | a reader | 2009/10/01 06:25 AM |
ARM architectural license | Richard Cownie | 2009/10/01 07:21 AM |
Why Cortex A9? | slacker | 2009/09/30 06:12 PM |
ARM architectural license | David Kanter | 2009/09/30 06:16 PM |
Why Cortex A9? | Michael S | 2009/10/01 06:45 AM |
Why Cortex A9? | slacker | 2009/10/02 01:41 AM |
Why Cortex A9? | Richard Cownie | 2009/10/02 09:28 AM |
Questions... | David Kanter | 2009/10/02 09:56 AM |
Questions... | Richard Cownie | 2009/10/02 10:29 AM |
Questions... | Wilco | 2009/10/02 12:05 PM |
Questions... | slacker | 2009/10/02 07:51 PM |
Why Cortex A9? | slacker | 2009/10/02 07:44 PM |
Why Cortex A9? | David W. Hess | 2009/09/30 07:42 AM |
Thoughts and questions on the Cortex A9 | Gabriele Svelto | 2009/09/28 12:28 AM |
Thoughts and questions on the Cortex A9 | Wilco | 2009/09/26 06:38 AM |
Thoughts and questions on the Cortex A9 | Gabriele Svelto | 2009/09/28 12:38 AM |
Thoughts and questions on the Cortex A9 | Costanza | 2009/10/01 02:45 PM |
Thoughts and questions on the Cortex A9 | sylt | 2009/09/28 04:54 AM |
Thoughts and questions on the Cortex A9 | Wilco | 2009/09/29 12:15 AM |