Article: PhysX87: Software Deficiency
By: Vincent Diepeveen (diep.delete@this.xs4all.nl), July 19, 2010 3:36 pm
Room: Moderated Discussions
David Kanter (dkanter@realworldtech.com) on 7/7/10 wrote:
---------------------------
>John Mann (xman52373@aol.com) on 7/7/10 wrote:
>---------------------------
>>While your article has merit, a few things you have over >looked. There is a water
>>Simulation out using a 980X i7 CPU where all 12 core(6 real >and 6 virtual) are pegged
>>at 100% and it still GETS trounced by a 9600GT running the >same simulation. A peformance
>>differnce of around 10x if I recall correctly. And the code >run on the i7 was changed
>>to run SSE code paths. So the boost could be seen, but the >GPU is and will always
>>be magnitudes faster than CPU regardless of the code paths.
>
>Magnitudes faster is an extreme exaggeration. The most powerful consumer GPU on
>the market (from NV) is ~1.3TFLOP/s and 177GB/s memory bandwidth. You can get a
>Westmere derivative that's 160GFLOP/s and 32GB/s memory bandwidth.
As for the GPU's you're a bit off. What nvidia and amd claim is not so relevant. Let's look to serious applications. A public post of some chinese researchers who do massive GPU calculations, so we speak of SERIOUS software, that's both optimized for nvidia as well as amd gpu's, they claim, realize this is SINGLE precision calculations ideal for GPU's:
AMD: 50% efficiency (from claimed peak)
Nvidia: 25% efficiency (from claimed peak)
This is in line with other researchers at N*SA type organisations which when doing home programming claim about 25-30% efficiency, and that's efficiency from raw calculations not taking into account multiply-add (as most of that software isn't doing multiply-add, otherwise it would run even slower of course).
Now if we look to theoretical performance.
Future FERMI of course we hope the cache and the double precision claims are true and making it a fast chip; yet single precision of course it's not so fast. I would argue if it is fast in double precision then forget about the word single precision.
For single precision calculations of course AMD is the superior platform. Cheap and 1600 streamcores.
950Mhz * 1600 streamcores is always going to beat of course 400 streamcores @ 1.15Ghz, in terms of raw single precision performance.
So i do not really know where your claim comes from that Nvidia on paper has a bigger single precision floating point capacity in the chip.
Nvidia, realize this is gpu not yet released AFAIK:
448 cores * 4 single precision units per vector * 1.15Ghz
== 2 Tflop single precision.
Note it is 144GB/s according to nvidia claims over the RAM in bandwidth, not 177GB/s.
Now the claim is this is 515 Gflops double precision. I don't really understand how they plan to achieve that, but we'll soon see testresults when the thing is there, as uni's tend to buy 1 of the latest nvidia gpu's usually.
It'll be too expensive of course that thing for a guy like me to buy it.
Yet realize that 25% to 30% efficiency means effectively that single precision a single such Tesla from nvidia can throw into battle for me effectively 500+ Gflop single precision. That's a lot.
Now if we switch to a cheapo card from AMD which is 3200 streamcores that's for example the Radeon HD 5970.
It has a cost price of 1000 euro nearly. Too much for my wallet, but well.
Yet it's a 'tad' faster than a similar priced nvidia.
3200 streamcores * 0.9Ghz ==> that's a lot more than whatever Nvidia has currently on the shelves.
Note their own claim is 4.64 Tflop, though i don't really see how they get to that number.
http://www.amd.com/us/products/desktop/graphics/ati-radeon-hd-5000/hd-5970/Pages/ati-radeon-hd-5970-specifications.aspx
Also we can get roughly 50% effectively out of it, which is of course the thing that really kicks butt.
It's obvious we speak about Tflops you get out of it single precision, thereby beating nvidia in price at least factor 2 and in performance also at least factor 2.
For single precision again.
>
>From a theoretical stand point, that gives a GPU 8X more compute cycles and 5.5X more bandwidth.
Forget the bandwidth, what needs bandwidth also needs huge RAM and/or streaming to harddrives; that's reality.
Your 8x compute cycles from practical viewpoint is not far off, though you used the wrong numbers.
>If a reasonable individual saw a 9600GT trouncing a 6-core Westmere, they would
>probably have some rather pointed questions about the nature of the workloads, whether
>it was utilizing SIMD or multiple cores appropriately...or generally what the differences were.
It is 2 total different sciences. If Fermi delivers what nvidia promises that would revolutionize things.
Right now it's only a few individuals and organisations that use gpu's for single precision workloads. Double precision is far away, yet so closeby.
AMD is total toasting Nvidia in the single precision domain.
Nvidia makes a chance to take over half the HPC if their gpu really delivers 515 gflops double precision for matrix calculations.
If i remember well this cell2 chippie delivers each node @ $2k or something roughly like 150 gflops (double precision) that's for 2 chips.
If a gpu now delivers effectively nearly as much as a power6 that would be kick butt of course, as pricewise nothing can compete with that.
Now i hope that the Fermi is priced cheaper than $2k.
Actually would be wise from nvidia to not price it too high - as they can take over large HPC area then.
This has the potential to revolutionize HPC in double precision area.
Again in single precision forget it, Nvidia too systematically loses it from AMD-ATI there, because nvidia's gpu's have just 1/4th of the number of cores of an AMD. This where it is a corecount battle in fact.
0.9Ghz versus 1.15Ghz is not really a big deal at a terrain where it is about core counts.
In HPC the huge difference is the ECC of course.
AMD-ATI does not qualify there with its GPU's.
Yet all this hardware, whether it is magny cours or westmere with complicated SIMD instructions and a bunch of cores, or the complete vectorisation needed for GPU's, it all requires also a lot better software engineering and a lot of mathematical models that are bugfree. That's going to be the hard part, as no one really wants to pay for software it seems.
Vincent
---------------------------
>John Mann (xman52373@aol.com) on 7/7/10 wrote:
>---------------------------
>>While your article has merit, a few things you have over >looked. There is a water
>>Simulation out using a 980X i7 CPU where all 12 core(6 real >and 6 virtual) are pegged
>>at 100% and it still GETS trounced by a 9600GT running the >same simulation. A peformance
>>differnce of around 10x if I recall correctly. And the code >run on the i7 was changed
>>to run SSE code paths. So the boost could be seen, but the >GPU is and will always
>>be magnitudes faster than CPU regardless of the code paths.
>
>Magnitudes faster is an extreme exaggeration. The most powerful consumer GPU on
>the market (from NV) is ~1.3TFLOP/s and 177GB/s memory bandwidth. You can get a
>Westmere derivative that's 160GFLOP/s and 32GB/s memory bandwidth.
As for the GPU's you're a bit off. What nvidia and amd claim is not so relevant. Let's look to serious applications. A public post of some chinese researchers who do massive GPU calculations, so we speak of SERIOUS software, that's both optimized for nvidia as well as amd gpu's, they claim, realize this is SINGLE precision calculations ideal for GPU's:
AMD: 50% efficiency (from claimed peak)
Nvidia: 25% efficiency (from claimed peak)
This is in line with other researchers at N*SA type organisations which when doing home programming claim about 25-30% efficiency, and that's efficiency from raw calculations not taking into account multiply-add (as most of that software isn't doing multiply-add, otherwise it would run even slower of course).
Now if we look to theoretical performance.
Future FERMI of course we hope the cache and the double precision claims are true and making it a fast chip; yet single precision of course it's not so fast. I would argue if it is fast in double precision then forget about the word single precision.
For single precision calculations of course AMD is the superior platform. Cheap and 1600 streamcores.
950Mhz * 1600 streamcores is always going to beat of course 400 streamcores @ 1.15Ghz, in terms of raw single precision performance.
So i do not really know where your claim comes from that Nvidia on paper has a bigger single precision floating point capacity in the chip.
Nvidia, realize this is gpu not yet released AFAIK:
448 cores * 4 single precision units per vector * 1.15Ghz
== 2 Tflop single precision.
Note it is 144GB/s according to nvidia claims over the RAM in bandwidth, not 177GB/s.
Now the claim is this is 515 Gflops double precision. I don't really understand how they plan to achieve that, but we'll soon see testresults when the thing is there, as uni's tend to buy 1 of the latest nvidia gpu's usually.
It'll be too expensive of course that thing for a guy like me to buy it.
Yet realize that 25% to 30% efficiency means effectively that single precision a single such Tesla from nvidia can throw into battle for me effectively 500+ Gflop single precision. That's a lot.
Now if we switch to a cheapo card from AMD which is 3200 streamcores that's for example the Radeon HD 5970.
It has a cost price of 1000 euro nearly. Too much for my wallet, but well.
Yet it's a 'tad' faster than a similar priced nvidia.
3200 streamcores * 0.9Ghz ==> that's a lot more than whatever Nvidia has currently on the shelves.
Note their own claim is 4.64 Tflop, though i don't really see how they get to that number.
http://www.amd.com/us/products/desktop/graphics/ati-radeon-hd-5000/hd-5970/Pages/ati-radeon-hd-5970-specifications.aspx
Also we can get roughly 50% effectively out of it, which is of course the thing that really kicks butt.
It's obvious we speak about Tflops you get out of it single precision, thereby beating nvidia in price at least factor 2 and in performance also at least factor 2.
For single precision again.
>
>From a theoretical stand point, that gives a GPU 8X more compute cycles and 5.5X more bandwidth.
Forget the bandwidth, what needs bandwidth also needs huge RAM and/or streaming to harddrives; that's reality.
Your 8x compute cycles from practical viewpoint is not far off, though you used the wrong numbers.
>If a reasonable individual saw a 9600GT trouncing a 6-core Westmere, they would
>probably have some rather pointed questions about the nature of the workloads, whether
>it was utilizing SIMD or multiple cores appropriately...or generally what the differences were.
It is 2 total different sciences. If Fermi delivers what nvidia promises that would revolutionize things.
Right now it's only a few individuals and organisations that use gpu's for single precision workloads. Double precision is far away, yet so closeby.
AMD is total toasting Nvidia in the single precision domain.
Nvidia makes a chance to take over half the HPC if their gpu really delivers 515 gflops double precision for matrix calculations.
If i remember well this cell2 chippie delivers each node @ $2k or something roughly like 150 gflops (double precision) that's for 2 chips.
If a gpu now delivers effectively nearly as much as a power6 that would be kick butt of course, as pricewise nothing can compete with that.
Now i hope that the Fermi is priced cheaper than $2k.
Actually would be wise from nvidia to not price it too high - as they can take over large HPC area then.
This has the potential to revolutionize HPC in double precision area.
Again in single precision forget it, Nvidia too systematically loses it from AMD-ATI there, because nvidia's gpu's have just 1/4th of the number of cores of an AMD. This where it is a corecount battle in fact.
0.9Ghz versus 1.15Ghz is not really a big deal at a terrain where it is about core counts.
In HPC the huge difference is the ECC of course.
AMD-ATI does not qualify there with its GPU's.
Yet all this hardware, whether it is magny cours or westmere with complicated SIMD instructions and a bunch of cores, or the complete vectorisation needed for GPU's, it all requires also a lot better software engineering and a lot of mathematical models that are bugfree. That's going to be the hard part, as no one really wants to pay for software it seems.
Vincent
Topic | Posted By | Date |
---|---|---|
A bit off base | John Mann | 2010/07/07 07:04 AM |
A bit off base | David Kanter | 2010/07/07 11:28 AM |
SSE vs x87 | Joel Hruska | 2010/07/07 12:53 PM |
SSE vs x87 | Michael S | 2010/07/07 01:07 PM |
SSE vs x87 | hobold | 2010/07/08 05:12 AM |
SSE vs x87 | David Kanter | 2010/07/07 02:55 PM |
SSE vs x87 | Andi Kleen | 2010/07/08 02:43 AM |
80 bit FP | Ricardo B | 2010/07/08 07:35 AM |
80 bit FP | David Kanter | 2010/07/08 11:14 AM |
80 bit FP | Kevin G | 2010/07/08 02:12 PM |
80 bit FP | Ian Ollmann | 2010/07/19 12:49 AM |
80 bit FP | David Kanter | 2010/07/19 11:33 AM |
80 bit FP | Anil Maliyekkel | 2010/07/19 04:49 PM |
80 bit FP | rwessel | 2010/07/19 05:41 PM |
80 bit FP | Matt Waldhauer | 2010/07/21 11:11 AM |
80 bit FP | Emil Briggs | 2010/07/22 09:06 AM |
A bit off base | John Mann | 2010/07/08 11:06 AM |
A bit off base | David Kanter | 2010/07/08 11:27 AM |
A bit off base | Ian Ameline | 2010/07/09 10:10 AM |
A bit off base | Michael S | 2010/07/10 02:13 PM |
A bit off base | Ian Ameline | 2010/07/11 07:51 AM |
A bit off base | David Kanter | 2010/07/07 09:46 PM |
A bit off base | Anon | 2010/07/08 12:47 AM |
A bit off base | anon | 2010/07/08 02:15 AM |
A bit off base | Gabriele Svelto | 2010/07/08 04:11 AM |
Physics engine history | Peter Clare | 2010/07/08 04:49 AM |
Physics engine history | Null Pointer Exception | 2010/07/08 06:07 AM |
Physics engine history | Ralf | 2010/07/08 03:09 PM |
Physics engine history | David Kanter | 2010/07/08 04:16 PM |
Physics engine history | sJ | 2010/07/08 11:36 PM |
Physics engine history | Gabriele Svelto | 2010/07/09 12:59 AM |
Physics engine history | sJ | 2010/07/13 06:35 AM |
Physics engine history | David Kanter | 2010/07/09 09:25 AM |
Physics engine history | sJ | 2010/07/13 06:49 AM |
Physics engine history | fvdbergh | 2010/07/13 07:27 AM |
A bit off base | John Mann | 2010/07/08 11:11 AM |
A bit off base | David Kanter | 2010/07/08 11:31 AM |
150 GFLOP/s measured? | anon | 2010/07/08 07:10 PM |
150 GFLOP/s measured? | David Kanter | 2010/07/08 07:53 PM |
150 GFLOP/s measured? | Aaron Spink | 2010/07/08 09:05 PM |
150 GFLOP/s measured? | anon | 2010/07/08 09:31 PM |
150 GFLOP/s measured? | Aaron Spink | 2010/07/08 10:43 PM |
150 GFLOP/s measured? | David Kanter | 2010/07/08 11:27 PM |
150 GFLOP/s measured? | Ian Ollmann | 2010/07/19 01:14 AM |
150 GFLOP/s measured? | anon | 2010/07/19 06:39 AM |
150 GFLOP/s measured? | hobold | 2010/07/19 07:26 AM |
Philosophy for achieving peak | David Kanter | 2010/07/19 11:49 AM |
150 GFLOP/s measured? | Linus Torvalds | 2010/07/19 07:36 AM |
150 GFLOP/s measured? | Richard Cownie | 2010/07/19 08:42 AM |
150 GFLOP/s measured? | Aaron Spink | 2010/07/19 08:56 AM |
150 GFLOP/s measured? | hobold | 2010/07/19 09:30 AM |
150 GFLOP/s measured? | Groo | 2010/07/19 02:31 PM |
150 GFLOP/s measured? | hobold | 2010/07/19 04:17 PM |
150 GFLOP/s measured? | Groo | 2010/07/19 06:18 PM |
150 GFLOP/s measured? | Anon | 2010/07/19 06:18 PM |
150 GFLOP/s measured? | Mark Roulo | 2010/07/19 11:47 AM |
150 GFLOP/s measured? | slacker | 2010/07/19 12:55 PM |
150 GFLOP/s measured? | Mark Roulo | 2010/07/19 01:00 PM |
150 GFLOP/s measured? | anonymous42 | 2010/07/25 12:31 PM |
150 GFLOP/s measured? | Richard Cownie | 2010/07/19 12:41 PM |
150 GFLOP/s measured? | Linus Torvalds | 2010/07/19 02:57 PM |
150 GFLOP/s measured? | Richard Cownie | 2010/07/19 04:10 PM |
150 GFLOP/s measured? | Richard Cownie | 2010/07/19 04:10 PM |
150 GFLOP/s measured? | hobold | 2010/07/19 04:25 PM |
150 GFLOP/s measured? | Linus Torvalds | 2010/07/19 04:31 PM |
150 GFLOP/s measured? | Richard Cownie | 2010/07/20 06:04 AM |
150 GFLOP/s measured? | jrl | 2010/07/20 01:18 AM |
150 GFLOP/s measured? | anonymous42 | 2010/07/25 12:00 PM |
150 GFLOP/s measured? | David Kanter | 2010/07/25 12:52 PM |
150 GFLOP/s measured? | Anon | 2010/07/19 06:15 PM |
150 GFLOP/s measured? | Linus Torvalds | 2010/07/19 07:27 PM |
150 GFLOP/s measured? | Anon | 2010/07/19 09:54 PM |
150 GFLOP/s measured? | anon | 2010/07/19 11:45 PM |
150 GFLOP/s measured? | hobold | 2010/07/19 09:14 AM |
150 GFLOP/s measured? | Linus Torvalds | 2010/07/19 11:56 AM |
150 GFLOP/s measured? | a reader | 2010/07/21 08:16 PM |
150 GFLOP/s measured? | Linus Torvalds | 2010/07/21 09:05 PM |
150 GFLOP/s measured? | anon | 2010/07/22 02:09 AM |
150 GFLOP/s measured? | a reader | 2010/07/22 07:53 PM |
150 GFLOP/s measured? | gallier2 | 2010/07/23 05:58 AM |
150 GFLOP/s measured? | a reader | 2010/07/25 08:35 AM |
150 GFLOP/s measured? | David Kanter | 2010/07/25 11:49 AM |
150 GFLOP/s measured? | a reader | 2010/07/26 07:03 PM |
150 GFLOP/s measured? | Michael S | 2010/07/28 01:38 AM |
150 GFLOP/s measured? | Gabriele Svelto | 2010/07/28 01:44 AM |
150 GFLOP/s measured? | anon | 2010/07/23 04:55 PM |
150 GFLOP/s measured? | slacker | 2010/07/24 12:48 AM |
150 GFLOP/s measured? | anon | 2010/07/24 02:36 AM |
150 GFLOP/s measured? | Vincent Diepeveen | 2010/07/27 05:37 PM |
150 GFLOP/s measured? | ? | 2010/07/27 11:42 PM |
150 GFLOP/s measured? | slacker | 2010/07/28 05:55 AM |
Intel's clock rate projections | AM | 2010/07/28 02:03 AM |
nostalgia ain't what it used to be | someone | 2010/07/28 05:38 AM |
Intel's clock rate projections | AM | 2010/07/28 10:12 PM |
Separate the OoO-ness from speculative-ness | ? | 2010/07/20 07:19 AM |
Separate the OoO-ness from speculative-ness | Mark Christiansen | 2010/07/20 02:26 PM |
Separate the OoO-ness from speculative-ness | slacker | 2010/07/20 06:04 PM |
Separate the OoO-ness from speculative-ness | Matt Sayler | 2010/07/20 06:10 PM |
Separate the OoO-ness from speculative-ness | slacker | 2010/07/20 09:37 PM |
Separate the OoO-ness from speculative-ness | ? | 2010/07/20 11:51 PM |
Separate the OoO-ness from speculative-ness | anon | 2010/07/21 02:16 AM |
Separate the OoO-ness from speculative-ness | ? | 2010/07/21 07:05 AM |
Software conventions | Paul A. Clayton | 2010/07/21 08:52 AM |
Software conventions | ? | 2010/07/22 05:43 AM |
Speculation | David Kanter | 2010/07/21 10:32 AM |
Pipelining affects the ISA | ? | 2010/07/22 10:58 PM |
Pipelining affects the ISA | ? | 2010/07/22 11:14 PM |
Pipelining affects the ISA | rwessel | 2010/07/23 12:03 AM |
Pipelining affects the ISA | ? | 2010/07/23 05:50 AM |
Pipelining affects the ISA | ? | 2010/07/23 06:10 AM |
Pipelining affects the ISA | Thiago Kurovski | 2010/07/23 02:59 PM |
Pipelining affects the ISA | anon | 2010/07/24 07:35 AM |
Pipelining affects the ISA | Thiago Kurovski | 2010/07/24 11:12 AM |
Pipelining affects the ISA | Gabriele Svelto | 2010/07/26 02:50 AM |
Pipelining affects the ISA | IlleglWpns | 2010/07/26 05:14 AM |
Pipelining affects the ISA | Michael S | 2010/07/26 03:33 PM |
Separate the OoO-ness from speculative-ness | anon | 2010/07/21 05:53 PM |
Separate the OoO-ness from speculative-ness | ? | 2010/07/22 04:15 AM |
Separate the OoO-ness from speculative-ness | anon | 2010/07/22 04:27 AM |
Separate the OoO-ness from speculative-ness | slacker | 2010/07/21 07:45 PM |
Separate the OoO-ness from speculative-ness | anon | 2010/07/22 01:57 AM |
Separate the OoO-ness from speculative-ness | ? | 2010/07/22 05:26 AM |
Separate the OoO-ness from speculative-ness | Dan Downs | 2010/07/22 08:14 AM |
Confusing and not very useful definition | David Kanter | 2010/07/22 12:41 PM |
Confusing and not very useful definition | ? | 2010/07/22 10:58 PM |
Confusing and not very useful definition | Ungo | 2010/07/24 12:06 PM |
Confusing and not very useful definition | ? | 2010/07/25 10:23 PM |
Separate the OoO-ness from speculative-ness | someone | 2010/07/20 08:02 PM |
Separate the OoO-ness from speculative-ness | Thiago Kurovski | 2010/07/21 04:13 PM |
You are just quoting SINGLE precision flops? OMG what planet do you live? | Vincent Diepeveen | 2010/07/19 10:26 AM |
The prior poster was talking about SP (NT) | David Kanter | 2010/07/19 11:34 AM |
All FFT's need double precision | Vincent Diepeveen | 2010/07/19 02:02 PM |
All FFT's need double precision | David Kanter | 2010/07/19 02:09 PM |
All FFT's need double precision | Vincent Diepeveen | 2010/07/19 04:06 PM |
All FFT's need double precision - not | Michael S | 2010/07/20 01:16 AM |
All FFT's need double precision - not | Ungo | 2010/07/21 12:04 AM |
All FFT's need double precision - not | Michael S | 2010/07/21 02:35 PM |
All FFT's need double precision - not | EduardoS | 2010/07/21 02:52 PM |
All FFT's need double precision - not | Anon | 2010/07/21 05:23 PM |
All FFT's need double precision - not | Ricardo B | 2010/07/26 07:46 AM |
I'm on a boat! | anon | 2010/07/22 11:42 AM |
All FFT's need double precision - not | Vincent Diepeveen | 2010/07/24 11:39 PM |
All FFT's need double precision - not | slacker | 2010/07/25 03:27 AM |
All FFT's need double precision - not | Ricardo B | 2010/07/26 07:40 AM |
All FFT's need double precision - not | EduardoS | 2010/07/25 08:37 AM |
All FFT's need double precision - not | Michael S | 2010/07/25 10:43 AM |
All FFT's need double precision - not | Vincent Diepeveen | 2010/07/24 11:19 PM |
A bit off base | EduardoS | 2010/07/08 04:08 PM |
A bit off base | Groo | 2010/07/08 06:11 PM |
A bit off base | john mann | 2010/07/08 06:58 PM |
All right...let's cool it... | David Kanter | 2010/07/08 07:54 PM |
A bit off base | Vincent Diepeveen | 2010/07/19 03:36 PM |