Article: PhysX87: Software Deficiency
By: David Kanter (dkanter.delete@this.realworldtech.com), July 19, 2010 11:49 am
Room: Moderated Discussions
hobold (hobold@vectorizer.org) on 7/19/10 wrote:
---------------------------
>anon (anon@anon.com) on 7/19/10 wrote:
>---------------------------
>[...]
>>I don't see how an OOOE engine can possibly "trip over itself" simply due to being
>>OOOE.
>
>In that strictness, your statement is probably correct. I guess the more fundamental
>issue behind the sentiment - that complex OoOE engines tend to make optimizing for
>theoretical peak throughput harder - lies in the more general idea that those machines
>are optimized for the "average case".
>
>Unfortunately, that average case is never even remotely close to maximum theoretical
>throughput. As a result, some of those complex OoOE machines cannot actually achieve
>peak theoretical throughput in practice, because a queue here or a buffer there
>does not have the capacity to keep up with the amount of in-flight instructions
>in that most extreme case. (An example of this are the loop epicycles in PPro and
>Netburst, where successive iterations of simple loops go through a deterministic
>but weird sequence of varying timing, depending on the current watermarks in the various queues and buffers.)
>
>I personally think this is one of the very few failures of the quantitative approach
>to processor design. On the face of it, optimizing these queues and those buffers
>for smaller size will in fact save a bit of the silicon and power budgets. But a
>consequence is that "no one achieves peak throughput anyway" becomes a self fulfilling prophecy.
>
>In my not so humble opinion, the fact that such highly >tuned code makes up only
>a tiny percentage of all existing software does not imply >an only tiny importance of such codes.
Obviously not. But I think the issue is that pretty much everyone in the mobile and PC space (including phones) has opted to spend transistors, power and design effort on improving the average case. For very specialize purpose applications, that trade-off may not make sense (e.g. game consoles).
But philosophically speaking, I think it makes a ton of sense to try and improve the average case, as long as you are not making things horrible for skilled users.
While only 1% of code may be tuned, it is probably widely replicated across systems. e.g. GPU drivers are quite tuned, and very common. The tricky part is that unfortunately, most PCs are bought for non-optimal code. So the balance between the two is very tricky.
>As Ian Ollmann mentioned, the real reason why "perfectly" >tuned code is so rare
>is that it tends to lose its optimality with every small >change of the hardware.
Which is probably why this only happens for consoles and deeply embedded applications.
>This would not be the case if the hardware could more >reliably be persuaded to run
>at peak throughput. If tuned code didn't break so easily, >there would be a larger
>gamut of it at any given time, and there would be more >business sense in taking the time to tune and maintain it.
>
>As it stands now, the self-fulfilling prophecy takes care that only code poets
>and desperados ever try to reach "light speed". From personal experience, I can
>tell you that it is very frustrating to see your code slow down on new hardware,
>considering the fact that you lovingly tuned it over the >course of quite some time.
I agree this is an issue, and it's one of the areas that I think CPUs need to step up a bit. In an ideal world, the CPU would only require optimizations that are forward compatible (e.g. use AVX or SSE, not x87), rather than one-time hacks. One of the nice things about the architecture of TMTA's chips (or a GPU) is that there is enough flexibility in the interface that the driver/JIT can be the one responsible for microarchitectural tweaks. This leaves the programmer/compiler free to focus on more generally useful optimizations.
David
---------------------------
>anon (anon@anon.com) on 7/19/10 wrote:
>---------------------------
>[...]
>>I don't see how an OOOE engine can possibly "trip over itself" simply due to being
>>OOOE.
>
>In that strictness, your statement is probably correct. I guess the more fundamental
>issue behind the sentiment - that complex OoOE engines tend to make optimizing for
>theoretical peak throughput harder - lies in the more general idea that those machines
>are optimized for the "average case".
>
>Unfortunately, that average case is never even remotely close to maximum theoretical
>throughput. As a result, some of those complex OoOE machines cannot actually achieve
>peak theoretical throughput in practice, because a queue here or a buffer there
>does not have the capacity to keep up with the amount of in-flight instructions
>in that most extreme case. (An example of this are the loop epicycles in PPro and
>Netburst, where successive iterations of simple loops go through a deterministic
>but weird sequence of varying timing, depending on the current watermarks in the various queues and buffers.)
>
>I personally think this is one of the very few failures of the quantitative approach
>to processor design. On the face of it, optimizing these queues and those buffers
>for smaller size will in fact save a bit of the silicon and power budgets. But a
>consequence is that "no one achieves peak throughput anyway" becomes a self fulfilling prophecy.
>
>In my not so humble opinion, the fact that such highly >tuned code makes up only
>a tiny percentage of all existing software does not imply >an only tiny importance of such codes.
Obviously not. But I think the issue is that pretty much everyone in the mobile and PC space (including phones) has opted to spend transistors, power and design effort on improving the average case. For very specialize purpose applications, that trade-off may not make sense (e.g. game consoles).
But philosophically speaking, I think it makes a ton of sense to try and improve the average case, as long as you are not making things horrible for skilled users.
While only 1% of code may be tuned, it is probably widely replicated across systems. e.g. GPU drivers are quite tuned, and very common. The tricky part is that unfortunately, most PCs are bought for non-optimal code. So the balance between the two is very tricky.
>As Ian Ollmann mentioned, the real reason why "perfectly" >tuned code is so rare
>is that it tends to lose its optimality with every small >change of the hardware.
Which is probably why this only happens for consoles and deeply embedded applications.
>This would not be the case if the hardware could more >reliably be persuaded to run
>at peak throughput. If tuned code didn't break so easily, >there would be a larger
>gamut of it at any given time, and there would be more >business sense in taking the time to tune and maintain it.
>
>As it stands now, the self-fulfilling prophecy takes care that only code poets
>and desperados ever try to reach "light speed". From personal experience, I can
>tell you that it is very frustrating to see your code slow down on new hardware,
>considering the fact that you lovingly tuned it over the >course of quite some time.
I agree this is an issue, and it's one of the areas that I think CPUs need to step up a bit. In an ideal world, the CPU would only require optimizations that are forward compatible (e.g. use AVX or SSE, not x87), rather than one-time hacks. One of the nice things about the architecture of TMTA's chips (or a GPU) is that there is enough flexibility in the interface that the driver/JIT can be the one responsible for microarchitectural tweaks. This leaves the programmer/compiler free to focus on more generally useful optimizations.
David
Topic | Posted By | Date |
---|---|---|
A bit off base | John Mann | 2010/07/07 07:04 AM |
A bit off base | David Kanter | 2010/07/07 11:28 AM |
SSE vs x87 | Joel Hruska | 2010/07/07 12:53 PM |
SSE vs x87 | Michael S | 2010/07/07 01:07 PM |
SSE vs x87 | hobold | 2010/07/08 05:12 AM |
SSE vs x87 | David Kanter | 2010/07/07 02:55 PM |
SSE vs x87 | Andi Kleen | 2010/07/08 02:43 AM |
80 bit FP | Ricardo B | 2010/07/08 07:35 AM |
80 bit FP | David Kanter | 2010/07/08 11:14 AM |
80 bit FP | Kevin G | 2010/07/08 02:12 PM |
80 bit FP | Ian Ollmann | 2010/07/19 12:49 AM |
80 bit FP | David Kanter | 2010/07/19 11:33 AM |
80 bit FP | Anil Maliyekkel | 2010/07/19 04:49 PM |
80 bit FP | rwessel | 2010/07/19 05:41 PM |
80 bit FP | Matt Waldhauer | 2010/07/21 11:11 AM |
80 bit FP | Emil Briggs | 2010/07/22 09:06 AM |
A bit off base | John Mann | 2010/07/08 11:06 AM |
A bit off base | David Kanter | 2010/07/08 11:27 AM |
A bit off base | Ian Ameline | 2010/07/09 10:10 AM |
A bit off base | Michael S | 2010/07/10 02:13 PM |
A bit off base | Ian Ameline | 2010/07/11 07:51 AM |
A bit off base | David Kanter | 2010/07/07 09:46 PM |
A bit off base | Anon | 2010/07/08 12:47 AM |
A bit off base | anon | 2010/07/08 02:15 AM |
A bit off base | Gabriele Svelto | 2010/07/08 04:11 AM |
Physics engine history | Peter Clare | 2010/07/08 04:49 AM |
Physics engine history | Null Pointer Exception | 2010/07/08 06:07 AM |
Physics engine history | Ralf | 2010/07/08 03:09 PM |
Physics engine history | David Kanter | 2010/07/08 04:16 PM |
Physics engine history | sJ | 2010/07/08 11:36 PM |
Physics engine history | Gabriele Svelto | 2010/07/09 12:59 AM |
Physics engine history | sJ | 2010/07/13 06:35 AM |
Physics engine history | David Kanter | 2010/07/09 09:25 AM |
Physics engine history | sJ | 2010/07/13 06:49 AM |
Physics engine history | fvdbergh | 2010/07/13 07:27 AM |
A bit off base | John Mann | 2010/07/08 11:11 AM |
A bit off base | David Kanter | 2010/07/08 11:31 AM |
150 GFLOP/s measured? | anon | 2010/07/08 07:10 PM |
150 GFLOP/s measured? | David Kanter | 2010/07/08 07:53 PM |
150 GFLOP/s measured? | Aaron Spink | 2010/07/08 09:05 PM |
150 GFLOP/s measured? | anon | 2010/07/08 09:31 PM |
150 GFLOP/s measured? | Aaron Spink | 2010/07/08 10:43 PM |
150 GFLOP/s measured? | David Kanter | 2010/07/08 11:27 PM |
150 GFLOP/s measured? | Ian Ollmann | 2010/07/19 01:14 AM |
150 GFLOP/s measured? | anon | 2010/07/19 06:39 AM |
150 GFLOP/s measured? | hobold | 2010/07/19 07:26 AM |
Philosophy for achieving peak | David Kanter | 2010/07/19 11:49 AM |
150 GFLOP/s measured? | Linus Torvalds | 2010/07/19 07:36 AM |
150 GFLOP/s measured? | Richard Cownie | 2010/07/19 08:42 AM |
150 GFLOP/s measured? | Aaron Spink | 2010/07/19 08:56 AM |
150 GFLOP/s measured? | hobold | 2010/07/19 09:30 AM |
150 GFLOP/s measured? | Groo | 2010/07/19 02:31 PM |
150 GFLOP/s measured? | hobold | 2010/07/19 04:17 PM |
150 GFLOP/s measured? | Groo | 2010/07/19 06:18 PM |
150 GFLOP/s measured? | Anon | 2010/07/19 06:18 PM |
150 GFLOP/s measured? | Mark Roulo | 2010/07/19 11:47 AM |
150 GFLOP/s measured? | slacker | 2010/07/19 12:55 PM |
150 GFLOP/s measured? | Mark Roulo | 2010/07/19 01:00 PM |
150 GFLOP/s measured? | anonymous42 | 2010/07/25 12:31 PM |
150 GFLOP/s measured? | Richard Cownie | 2010/07/19 12:41 PM |
150 GFLOP/s measured? | Linus Torvalds | 2010/07/19 02:57 PM |
150 GFLOP/s measured? | Richard Cownie | 2010/07/19 04:10 PM |
150 GFLOP/s measured? | Richard Cownie | 2010/07/19 04:10 PM |
150 GFLOP/s measured? | hobold | 2010/07/19 04:25 PM |
150 GFLOP/s measured? | Linus Torvalds | 2010/07/19 04:31 PM |
150 GFLOP/s measured? | Richard Cownie | 2010/07/20 06:04 AM |
150 GFLOP/s measured? | jrl | 2010/07/20 01:18 AM |
150 GFLOP/s measured? | anonymous42 | 2010/07/25 12:00 PM |
150 GFLOP/s measured? | David Kanter | 2010/07/25 12:52 PM |
150 GFLOP/s measured? | Anon | 2010/07/19 06:15 PM |
150 GFLOP/s measured? | Linus Torvalds | 2010/07/19 07:27 PM |
150 GFLOP/s measured? | Anon | 2010/07/19 09:54 PM |
150 GFLOP/s measured? | anon | 2010/07/19 11:45 PM |
150 GFLOP/s measured? | hobold | 2010/07/19 09:14 AM |
150 GFLOP/s measured? | Linus Torvalds | 2010/07/19 11:56 AM |
150 GFLOP/s measured? | a reader | 2010/07/21 08:16 PM |
150 GFLOP/s measured? | Linus Torvalds | 2010/07/21 09:05 PM |
150 GFLOP/s measured? | anon | 2010/07/22 02:09 AM |
150 GFLOP/s measured? | a reader | 2010/07/22 07:53 PM |
150 GFLOP/s measured? | gallier2 | 2010/07/23 05:58 AM |
150 GFLOP/s measured? | a reader | 2010/07/25 08:35 AM |
150 GFLOP/s measured? | David Kanter | 2010/07/25 11:49 AM |
150 GFLOP/s measured? | a reader | 2010/07/26 07:03 PM |
150 GFLOP/s measured? | Michael S | 2010/07/28 01:38 AM |
150 GFLOP/s measured? | Gabriele Svelto | 2010/07/28 01:44 AM |
150 GFLOP/s measured? | anon | 2010/07/23 04:55 PM |
150 GFLOP/s measured? | slacker | 2010/07/24 12:48 AM |
150 GFLOP/s measured? | anon | 2010/07/24 02:36 AM |
150 GFLOP/s measured? | Vincent Diepeveen | 2010/07/27 05:37 PM |
150 GFLOP/s measured? | ? | 2010/07/27 11:42 PM |
150 GFLOP/s measured? | slacker | 2010/07/28 05:55 AM |
Intel's clock rate projections | AM | 2010/07/28 02:03 AM |
nostalgia ain't what it used to be | someone | 2010/07/28 05:38 AM |
Intel's clock rate projections | AM | 2010/07/28 10:12 PM |
Separate the OoO-ness from speculative-ness | ? | 2010/07/20 07:19 AM |
Separate the OoO-ness from speculative-ness | Mark Christiansen | 2010/07/20 02:26 PM |
Separate the OoO-ness from speculative-ness | slacker | 2010/07/20 06:04 PM |
Separate the OoO-ness from speculative-ness | Matt Sayler | 2010/07/20 06:10 PM |
Separate the OoO-ness from speculative-ness | slacker | 2010/07/20 09:37 PM |
Separate the OoO-ness from speculative-ness | ? | 2010/07/20 11:51 PM |
Separate the OoO-ness from speculative-ness | anon | 2010/07/21 02:16 AM |
Separate the OoO-ness from speculative-ness | ? | 2010/07/21 07:05 AM |
Software conventions | Paul A. Clayton | 2010/07/21 08:52 AM |
Software conventions | ? | 2010/07/22 05:43 AM |
Speculation | David Kanter | 2010/07/21 10:32 AM |
Pipelining affects the ISA | ? | 2010/07/22 10:58 PM |
Pipelining affects the ISA | ? | 2010/07/22 11:14 PM |
Pipelining affects the ISA | rwessel | 2010/07/23 12:03 AM |
Pipelining affects the ISA | ? | 2010/07/23 05:50 AM |
Pipelining affects the ISA | ? | 2010/07/23 06:10 AM |
Pipelining affects the ISA | Thiago Kurovski | 2010/07/23 02:59 PM |
Pipelining affects the ISA | anon | 2010/07/24 07:35 AM |
Pipelining affects the ISA | Thiago Kurovski | 2010/07/24 11:12 AM |
Pipelining affects the ISA | Gabriele Svelto | 2010/07/26 02:50 AM |
Pipelining affects the ISA | IlleglWpns | 2010/07/26 05:14 AM |
Pipelining affects the ISA | Michael S | 2010/07/26 03:33 PM |
Separate the OoO-ness from speculative-ness | anon | 2010/07/21 05:53 PM |
Separate the OoO-ness from speculative-ness | ? | 2010/07/22 04:15 AM |
Separate the OoO-ness from speculative-ness | anon | 2010/07/22 04:27 AM |
Separate the OoO-ness from speculative-ness | slacker | 2010/07/21 07:45 PM |
Separate the OoO-ness from speculative-ness | anon | 2010/07/22 01:57 AM |
Separate the OoO-ness from speculative-ness | ? | 2010/07/22 05:26 AM |
Separate the OoO-ness from speculative-ness | Dan Downs | 2010/07/22 08:14 AM |
Confusing and not very useful definition | David Kanter | 2010/07/22 12:41 PM |
Confusing and not very useful definition | ? | 2010/07/22 10:58 PM |
Confusing and not very useful definition | Ungo | 2010/07/24 12:06 PM |
Confusing and not very useful definition | ? | 2010/07/25 10:23 PM |
Separate the OoO-ness from speculative-ness | someone | 2010/07/20 08:02 PM |
Separate the OoO-ness from speculative-ness | Thiago Kurovski | 2010/07/21 04:13 PM |
You are just quoting SINGLE precision flops? OMG what planet do you live? | Vincent Diepeveen | 2010/07/19 10:26 AM |
The prior poster was talking about SP (NT) | David Kanter | 2010/07/19 11:34 AM |
All FFT's need double precision | Vincent Diepeveen | 2010/07/19 02:02 PM |
All FFT's need double precision | David Kanter | 2010/07/19 02:09 PM |
All FFT's need double precision | Vincent Diepeveen | 2010/07/19 04:06 PM |
All FFT's need double precision - not | Michael S | 2010/07/20 01:16 AM |
All FFT's need double precision - not | Ungo | 2010/07/21 12:04 AM |
All FFT's need double precision - not | Michael S | 2010/07/21 02:35 PM |
All FFT's need double precision - not | EduardoS | 2010/07/21 02:52 PM |
All FFT's need double precision - not | Anon | 2010/07/21 05:23 PM |
All FFT's need double precision - not | Ricardo B | 2010/07/26 07:46 AM |
I'm on a boat! | anon | 2010/07/22 11:42 AM |
All FFT's need double precision - not | Vincent Diepeveen | 2010/07/24 11:39 PM |
All FFT's need double precision - not | slacker | 2010/07/25 03:27 AM |
All FFT's need double precision - not | Ricardo B | 2010/07/26 07:40 AM |
All FFT's need double precision - not | EduardoS | 2010/07/25 08:37 AM |
All FFT's need double precision - not | Michael S | 2010/07/25 10:43 AM |
All FFT's need double precision - not | Vincent Diepeveen | 2010/07/24 11:19 PM |
A bit off base | EduardoS | 2010/07/08 04:08 PM |
A bit off base | Groo | 2010/07/08 06:11 PM |
A bit off base | john mann | 2010/07/08 06:58 PM |
All right...let's cool it... | David Kanter | 2010/07/08 07:54 PM |
A bit off base | Vincent Diepeveen | 2010/07/19 03:36 PM |