By: Vincent Diepeveen (diep.delete@this.xs4all.nl), April 20, 2011 1:12 pm
Room: Moderated Discussions
Heikki Kultala (hkultala@iki.NOSPAM.fi) on 4/20/11 wrote:
---------------------------
>Vincent Diepeveen (diep@xs4all.nl) on 4/20/11 wrote:
>---------------------------
>>Heikki Kultala (hkultala@iki.NOSPAM.fi) on 4/20/11 wrote:
>>---------------------------
>>>>It's simple. It's 3072 cores @ 0.88Ghz versus nvidia 448 cores @ 1.2Ghz. 3072 cores
>>>>always win by factor 3-4 then or so.
>>>>
>>>>No discussions there.
>>>
>>>Wrong.
>>>
>>>It's 24 ATI cores per chip versus 28-32 nvidia cores per chip.
>>>
>>>And, it's 384 ATI SPMD lanes per chip versus 448-512 nvidia SPMD lanes per chip.
>>>
>>>VLIW ALU != core
>>
>>I wrote it in popular language, but that doesn't stop idiots like you.
>
>Says how things technically really are makes me an idiot?
Because first of all you say it wrong and also deliberately.
It has 2 gpu's and therefore 48 COMPUTE UNITS. Not 24.
Not cores. Remember that the Indian helpdesk, getting paid probably $1 an hour a person, has worked so so hard during winter and renamed everything again; so SIMD's also no longer exist ;)
Secondly you want to refer to some lobotomized overclocked gamers card of nvidia, but let's just look at their topend gpgpu card, as we're discussing gpgpu here, not gamers.
For gamers you can find dozens of others sites that are better and benchmarking every game already. The interesting stuff here is the technical description that David Kanter gave regarding Cayman (which is 100x more than AMD themselves - the 6900 architecture manual basically is a joke of a manual, just cut'n pasted evergreen manual with some words like 'the' replaced by 'a', not much more) and the raw processing power you can get out of it. They did produce it such that it can have 2 gpu's onto a single card and Fermi cannot have that.
So that 520 euro Radeon 6990 card with 3072 PE's as they are officially called nowadays, we compare with the 448 PE's of the 448 PE equipped Quadro 6000 or Tesla so you want. Not the 512 overclocked gamers cards, which in the first place are lobotomized and in the second place not intended for gpgpu.
>>It's 3072 PE's versus 448 PE's.
>
>Those are unit counts coming from marketting.
>Those are FP ALU counts. How to feed data to those matters.
>
>There are only 384 SPMD lanes per Cayman chip,
>so one chip can execute only 384 VLIW instructions per cycle (if the program counters
>are correctly aligned, if not, the worst case is 24 / chip)
there's 2 chips and you know this very well. You have to compare 48 compute units and 3072 PE's with the 448 ones of nvidia.
>But, those VLIW operations can actually include 5, not 4 operations total. (but
>the one has to be branch, only 4 FP operations).
>What makes this more complicated is how memory operations are handled.
>If memory operation is being handled, no ALU operations can execute at same time
>on ATI. I'm not sure how this goes on nvidia, might be similar.
Actually, description David gave of how the RAM functions is wrong, but as i'm not a memory expert i'll not comment too much on it. I heard someone call it: "piece of crap description. Sure it can already load it while the alu's work.
Yet if you want to get a lot of flops out of the gpu's you don't want to use much of a RAM at all; so the prime number code i write here isn't using device RAM at all. Now there is a few buggies still left in opencl (let's not get started about some indians who 'support' the gpu hardware) and their release cycle is a tad slow considering the huge number of bugs they have to fix each time (5970 for example you can only adress 1 of the 2 gpu's with opencl, which is a tad weird for your ex-flagship gpu isn't it?).
It's not here about which manufacturer is more crappy with support, as in case of nvidia you don't even know which instructions the chip has. there is no public manual describing that, at least from the 4000/5000 series AMD you have that manual from ATI (when AMD took over the middle ages started).
>>What's going to be FASTER in well designed gpgpu codes?
>>
>>0.83Ghz * 3072 PE's is *always* going to annihilate in well designed codes a meager 448 PE's @ 1.2Ghz Tesla.
>
>Did I say they would not? No.
But you try to scave off factor 2 of the performance the cayman delivers, as they come in 2 gpu's at 1 card for 500 and a little euro's.
The gpgpu card of nvidia i don't even see how to order it here, so that means import from USA and pay some import tax (20%) as well as $1200+ something.
So pricewise the 6990 delivers factor 5-10x the performance of the nvidia.
>I just told you are counting your cores incorrectly. And telling you are wrong seems to make me an idiot.
You're comparing in the wrong manner.
You must compare the number of PE's with each other as that's what the effective speed is of both those gpu's that they'll deliver you if you write great code for it.
If your argumentation is: "but my code isn't optimal" that's not relevant in this discussion, see what i wrote.
You start to scave away factor 2 and increase the amount of cores of nvidia, whereas those additional nvidia cores are simply not there in tesla. it has 448 PE's and not a single processing element more.
>But if the code does not have any ILP, only parallelism between work items, then
>3/4 or those ATI ALU's are idling. This practically means badly optimized code.
We know from the well optimized codes that the better gpu coders get a higher % out of AMD than out of Nvidia cards pre-tesla. Tesla is a big big improvement there, yet even if you manage to get a slighly higher % out of it thanks to CUDA being a better low level language than opencl (think of adding carries fast in cuda - try that in opencl), that still is peanuts. If you manage to get 700 gflops out of that tesla, counted in a theoretic manner (counting multiply-add as 2 flops), then you're a hero.
This whereas getting a practical 2 Tflop out of the 6990 is not even remote to hero status.
In such case pricewise the 6990 is over 5x cheaper per gflop, even the additional power consumption counted.
>We cannot ignore the actual SPMD lane count. Good way is to calculate the SPMD
>lanes, and THEN in addition calculate that ATI can do 4-way FP VLIW on one SPMD
>lane, nvidia only single floating point operation.
Most people here don't realize how many gflops the professionals get out of gpgpu.
Those gpu's CAN be very fast if you know how to program for them and they sure CANNOT provide for everything a solution.
Things have shifted to having a good coder, yet i'm not seeing companies/organisations pay for it yet, as a manager, at least here, *always* has to make more than a world class gpu coder.
Regards,
Vincent
---------------------------
>Vincent Diepeveen (diep@xs4all.nl) on 4/20/11 wrote:
>---------------------------
>>Heikki Kultala (hkultala@iki.NOSPAM.fi) on 4/20/11 wrote:
>>---------------------------
>>>>It's simple. It's 3072 cores @ 0.88Ghz versus nvidia 448 cores @ 1.2Ghz. 3072 cores
>>>>always win by factor 3-4 then or so.
>>>>
>>>>No discussions there.
>>>
>>>Wrong.
>>>
>>>It's 24 ATI cores per chip versus 28-32 nvidia cores per chip.
>>>
>>>And, it's 384 ATI SPMD lanes per chip versus 448-512 nvidia SPMD lanes per chip.
>>>
>>>VLIW ALU != core
>>
>>I wrote it in popular language, but that doesn't stop idiots like you.
>
>Says how things technically really are makes me an idiot?
Because first of all you say it wrong and also deliberately.
It has 2 gpu's and therefore 48 COMPUTE UNITS. Not 24.
Not cores. Remember that the Indian helpdesk, getting paid probably $1 an hour a person, has worked so so hard during winter and renamed everything again; so SIMD's also no longer exist ;)
Secondly you want to refer to some lobotomized overclocked gamers card of nvidia, but let's just look at their topend gpgpu card, as we're discussing gpgpu here, not gamers.
For gamers you can find dozens of others sites that are better and benchmarking every game already. The interesting stuff here is the technical description that David Kanter gave regarding Cayman (which is 100x more than AMD themselves - the 6900 architecture manual basically is a joke of a manual, just cut'n pasted evergreen manual with some words like 'the' replaced by 'a', not much more) and the raw processing power you can get out of it. They did produce it such that it can have 2 gpu's onto a single card and Fermi cannot have that.
So that 520 euro Radeon 6990 card with 3072 PE's as they are officially called nowadays, we compare with the 448 PE's of the 448 PE equipped Quadro 6000 or Tesla so you want. Not the 512 overclocked gamers cards, which in the first place are lobotomized and in the second place not intended for gpgpu.
>>It's 3072 PE's versus 448 PE's.
>
>Those are unit counts coming from marketting.
>Those are FP ALU counts. How to feed data to those matters.
>
>There are only 384 SPMD lanes per Cayman chip,
>so one chip can execute only 384 VLIW instructions per cycle (if the program counters
>are correctly aligned, if not, the worst case is 24 / chip)
there's 2 chips and you know this very well. You have to compare 48 compute units and 3072 PE's with the 448 ones of nvidia.
>But, those VLIW operations can actually include 5, not 4 operations total. (but
>the one has to be branch, only 4 FP operations).
>What makes this more complicated is how memory operations are handled.
>If memory operation is being handled, no ALU operations can execute at same time
>on ATI. I'm not sure how this goes on nvidia, might be similar.
Actually, description David gave of how the RAM functions is wrong, but as i'm not a memory expert i'll not comment too much on it. I heard someone call it: "piece of crap description. Sure it can already load it while the alu's work.
Yet if you want to get a lot of flops out of the gpu's you don't want to use much of a RAM at all; so the prime number code i write here isn't using device RAM at all. Now there is a few buggies still left in opencl (let's not get started about some indians who 'support' the gpu hardware) and their release cycle is a tad slow considering the huge number of bugs they have to fix each time (5970 for example you can only adress 1 of the 2 gpu's with opencl, which is a tad weird for your ex-flagship gpu isn't it?).
It's not here about which manufacturer is more crappy with support, as in case of nvidia you don't even know which instructions the chip has. there is no public manual describing that, at least from the 4000/5000 series AMD you have that manual from ATI (when AMD took over the middle ages started).
>>What's going to be FASTER in well designed gpgpu codes?
>>
>>0.83Ghz * 3072 PE's is *always* going to annihilate in well designed codes a meager 448 PE's @ 1.2Ghz Tesla.
>
>Did I say they would not? No.
But you try to scave off factor 2 of the performance the cayman delivers, as they come in 2 gpu's at 1 card for 500 and a little euro's.
The gpgpu card of nvidia i don't even see how to order it here, so that means import from USA and pay some import tax (20%) as well as $1200+ something.
So pricewise the 6990 delivers factor 5-10x the performance of the nvidia.
>I just told you are counting your cores incorrectly. And telling you are wrong seems to make me an idiot.
You're comparing in the wrong manner.
You must compare the number of PE's with each other as that's what the effective speed is of both those gpu's that they'll deliver you if you write great code for it.
If your argumentation is: "but my code isn't optimal" that's not relevant in this discussion, see what i wrote.
You start to scave away factor 2 and increase the amount of cores of nvidia, whereas those additional nvidia cores are simply not there in tesla. it has 448 PE's and not a single processing element more.
>But if the code does not have any ILP, only parallelism between work items, then
>3/4 or those ATI ALU's are idling. This practically means badly optimized code.
We know from the well optimized codes that the better gpu coders get a higher % out of AMD than out of Nvidia cards pre-tesla. Tesla is a big big improvement there, yet even if you manage to get a slighly higher % out of it thanks to CUDA being a better low level language than opencl (think of adding carries fast in cuda - try that in opencl), that still is peanuts. If you manage to get 700 gflops out of that tesla, counted in a theoretic manner (counting multiply-add as 2 flops), then you're a hero.
This whereas getting a practical 2 Tflop out of the 6990 is not even remote to hero status.
In such case pricewise the 6990 is over 5x cheaper per gflop, even the additional power consumption counted.
>We cannot ignore the actual SPMD lane count. Good way is to calculate the SPMD
>lanes, and THEN in addition calculate that ATI can do 4-way FP VLIW on one SPMD
>lane, nvidia only single floating point operation.
Most people here don't realize how many gflops the professionals get out of gpgpu.
Those gpu's CAN be very fast if you know how to program for them and they sure CANNOT provide for everything a solution.
Things have shifted to having a good coder, yet i'm not seeing companies/organisations pay for it yet, as a manager, at least here, *always* has to make more than a world class gpu coder.
Regards,
Vincent
Topic | Posted By | Date |
---|---|---|
New Article: Predicting GPU Performance for AMD and Nvidia | David Kanter | 2011/04/12 12:55 AM |
Graph is not red-green colorblind friendly (NT) | RatherNotSay | 2011/04/12 04:51 AM |
Fixed | David Kanter | 2011/04/12 09:46 AM |
New Article: Predicting GPU Performance for AMD and Nvidia | James | 2011/04/12 01:30 PM |
New Article: Predicting GPU Performance for AMD and Nvidia | David Kanter | 2011/04/12 03:51 PM |
Try HD6450 or HD6850 | EduardoS | 2011/04/12 04:31 PM |
Try HD6450 or HD6850 | David Kanter | 2011/04/13 11:25 AM |
Try HD6450 or HD6850 | EduardoS | 2011/04/13 04:20 PM |
of cause | Moritz | 2011/04/14 09:03 AM |
of cause | EduardoS | 2011/04/14 02:55 PM |
Barts = 5D | Moritz | 2011/04/14 10:26 PM |
Barts = 5D | Antti-Ville Tuunainen | 2011/04/15 01:38 AM |
Limiting fixed function units | Moritz | 2011/04/15 05:28 AM |
Limiting fixed function units | Vincent Diepeveen | 2011/04/20 03:38 AM |
lack of detail | Moritz | 2011/04/20 10:24 AM |
lack of detail | EduardoS | 2011/04/20 12:45 PM |
gpgpu | Vincent Diepeveen | 2011/04/16 03:10 AM |
gpgpu | EduardoS | 2011/04/17 01:31 PM |
gpgpu | Groo | 2011/04/17 01:58 PM |
gpgpu | EduardoS | 2011/04/17 02:08 PM |
gpgpu | Ian Ameline | 2011/04/18 04:55 PM |
gpgpu | Ping-Che Chen | 2011/04/19 01:59 AM |
GPU numerical compliance | Sylvain Collange | 2011/04/19 12:38 PM |
GPU numerical compliance | Vincent Diepeveen | 2011/04/20 03:17 AM |
gpgpu | Vincent Diepeveen | 2011/04/20 03:02 AM |
gpgpu and core counts | Heikki Kultala | 2011/04/20 05:41 AM |
gpgpu and core counts | Vincent Diepeveen | 2011/04/20 06:52 AM |
gpgpu and core counts | none | 2011/04/20 08:05 AM |
gpgpu and core counts | EduardoS | 2011/04/20 12:36 PM |
gpgpu and core counts | Heikki Kultala | 2011/04/20 11:16 AM |
gpgpu and core counts | EduardoS | 2011/04/20 12:34 PM |
gpgpu and core counts | Heikki Kultala | 2011/04/20 08:24 PM |
gpgpu and core counts | EduardoS | 2011/04/20 09:55 PM |
gpgpu and core counts | Heikki Kultala | 2011/04/21 07:48 AM |
gpgpu and core counts | EduardoS | 2011/04/22 02:41 PM |
AMD Compute and Texture Fetch | David Kanter | 2011/04/21 11:42 AM |
AMD Compute and Texture Fetch | Vincent Diepeveen | 2011/04/22 02:14 AM |
AMD Compute and Texture Fetch | David Kanter | 2011/04/22 11:53 AM |
AMD Compute and Texture Fetch | EduardoS | 2011/04/22 02:46 PM |
AMD Compute and Texture Fetch | David Kanter | 2011/04/22 03:02 PM |
AMD Compute and Texture Fetch | EduardoS | 2011/04/22 03:18 PM |
AMD Compute and Texture Fetch | anon | 2011/04/22 04:30 PM |
AMD Compute and Texture Fetch | David Kanter | 2011/04/22 10:17 PM |
gpgpu and core counts | Vincent Diepeveen | 2011/04/20 01:12 PM |
gpgpu and core counts | Heikki Kultala | 2011/04/21 11:23 AM |
gpgpu and core counts | Vincent Diepeveen | 2011/04/22 03:11 AM |
Keep the crazy politics out of this | David Kanter | 2011/04/22 09:39 AM |
Keep the crazy politics out of this | Vincent Diepeveen | 2011/04/22 10:12 AM |
Keep the crazy politics out of this | David Kanter | 2011/04/22 11:44 AM |
gpgpu and core counts | Jouni Osmala | 2011/04/22 12:06 PM |
gpgpu | EduardoS | 2011/04/20 12:59 PM |
gpgpu | Vincent Diepeveen | 2011/04/20 01:37 PM |
gpgpu | EduardoS | 2011/04/20 06:27 PM |
gpgpu | Vincent Diepeveen | 2011/04/21 03:06 AM |
gpgpu | EduardoS | 2011/04/22 03:00 PM |
New Article: Predicting GPU Performance for AMD and Nvidia | PiedPiper | 2011/04/12 11:05 PM |
New Article: Predicting GPU Performance for AMD and Nvidia | David Kanter | 2011/04/12 11:42 PM |
New Article: Predicting GPU Performance for AMD and Nvidia | MS | 2011/04/15 06:04 AM |
New Article: Predicting GPU Performance for AMD and Nvidia | Kevin G | 2011/04/16 03:25 AM |
New Article: Predicting GPU Performance for AMD and Nvidia | David Kanter | 2011/04/16 09:42 AM |
New Article: Predicting GPU Performance for AMD and Nvidia | Vincent Diepeveen | 2011/04/20 03:20 AM |
memory | Moritz | 2011/04/14 10:03 PM |
memory - more | Moritz | 2011/04/16 12:11 AM |
New Article: Predicting GPU Performance for AMD and Nvidia | Kevin G | 2011/04/14 12:30 PM |