Article: Predicting AMD and Nvidia GPU Performance
By: Vincent Diepeveen (, April 20, 2011 2:02 am
Room: Moderated Discussions
EduardoS ( on 4/17/11 wrote:
>Vincent Diepeveen ( on 4/16/11 wrote:
>>Besides that the new 6000 series has 2.5x more multiplication resources than the
>>5000 series for the very crucial 32 x 32 bits.
>No... Barts still only does 32 x 32 bits multiplication on T unit, like Cypress,
>Cayman take it's four units for a single 32 x 32 bits, no 2.5x improvement here.

Person A says x, Person B says y says that.
Let's rely more onto NSA type guys than a dude like you
until benchmarking proves clearly what's the case.

I'll have to benchmark it a lot anyway which type of multiplication is fastest.

>>Which for gpgpu matters really a lot.
>No it doesn't, 32 bits multiplication is pretty rare and hardware designer gives much importance to it.

It was rare because they were too slow doing that.

They were slow in 32 bits multiplication, until recently you could basically only work with 24 bits precision, which is too little.

Yet OpenCL requires 32 bits, so count at it, that it will get used a lot starting Fermi and 6000 series of AMD.

It's like telling someone: "we have a formula 1 engine for you, but it's tough to put it in a car". Give it some time and everyone will use it.

>>Let me quote a few gpu programmers here.

>Let's skip this part, just bullshit.

So far if i scroll through what you post over here, it's pathetic nonsense.

>>Nvidia will do great in games and will remain doing great in games.
>Let's see... GTX590 vs HD6990... A same TDP, same performance and manufacturing
>tech nVidia needs 35% more die size... Not exactly "doing great". And Cayman doesn't have that good performance/area...

It's simple. It's 3072 cores @ 0.88Ghz versus nvidia 448 cores @ 1.2Ghz. 3072 cores always win by factor 3-4 then or so.

No discussions there.

This isn't graphics benchmarks where higher clocked cores and big bandwidth to RAM matters, this is software customized for that hardware to do gpgpu; the speed of your execution units will determine things then and there is *no discussion there*.

The 3072 cores always will be faster by a factor 3+ over Nvidia's Tesla's for customized codes and always will be like that of course.

If you do effort you can achieve quite good performance. Just majority doing gpgpu programming in public is complete idiot programmers as they're students who still have to learn programming. Only the easy stuff so far they got to work.

Add to that, that the gpu's have no good manuals. Take the AMD 6900 series manual. it's just a cut'n paste of the evergreen manual, with here and there 2 letters changed.

The diagram still shows it having 5 units, instead of 4; only at 1 spot in the text they modified that instead of '5 units' it's now '4 units' sometimes. Which is nonsense as well as there is no 5 unit groups in the 6900 series at all.

It's not describing the 6900 accurate at all. It's an Indian guy who by our standards only has grammar school education, besides speaking English ok, who modified things a bit. Remove 'the' there, but technical 0% has been changed. Only some diagrams were removed he didn't understand it seems. These guys make $150 a month in India doing work like this. One of them (not working for AMD though) complained loud to me about his salary.

Nowhere in entire manual it's mentionned crucial hardware issues you have to take into account as a programmer. For example that a + b = c will take 4 cycles to complete and also therefore 4 cycles for c to be availabe for next instruction. Nowhere you'll find this at AMD nor Nvidia.

Only Volkov mentions it.

This complaint is true both for Nvidia as well as for AMD GPU's.

Now Nvidia has CUDA and AMD has OpenCL (soon Nvidia might have a working opencl as well - who knows); we know cuda is very efficient for the nvidia chip, opencl still has to prove itself for amd there.

If you google for Volkov and 'gpgpu' you learn more than from that entire manual.

This is the current bottleneck for students at gpu's. Instead of spam the internet and ask around, they rely solely on that manual.

>>He's comparing Fermi, the first generation Nvidia that has 32 x 32 bits multiplication
>>(it was 24 bits multiplication before) versus old generation AMD that did have 32
>>bits multiplication, but only what was it 1 unit out of 5.
>If I remeber correctly Fermi does 32 bits multiplication at half speed... I told you, nobody cares.

It's 2 cycles throughput latency indeed at Nvidia.

It's very important for gpgpu. You're underestimating 32 bits multiplications.

Though it's true that too many scientists care too much for double precision calculations, later more onto that.

Most scientists might be not too stupid in their own domain, yet they are very naive in figuring out methods how to calculate their results efficiently.

Only if there happens to be a library doing things very efficient, then usually they can use it. The problem is that each domain there requires a different transform, method or algorithms to get it done fast.

So you have to wait for that single guy in that area who manages to write efficient software that works. Entire community then uses this code or adapts it for his own purposes (regrettably can't use 'her' much yet).

As gpgpu is such a specialized form of hardware, it takes really a lot of programming qualities to get things done at it, the number of years it takes for gpgpu to become a succes really was guessed a bit optimistic by most people.

This is why i'm programming some stuff myself at gpu's now. It just isn't there.

>>Todays generation, just like nvidia, has been more optimized for that.
>In nVidia case, yes, in AMD case, no, in nVidia case it's more like a side effect.

With soon great gpu's integrated in cpu's, i doubt that gpgpu will be a side effect. Neither does the NCSA. They thought it would've dominated HPC by 2010 already so much that their own box would be a manycore processor based solution.

Yet their own rules stopped them there; NCSA adapts a tad slow there if you ask me.

The focus of most HPC organisations has been onto building 1 giant box;

i'd argue it's better to build a few machines, so that scientists can profit from whatever suits them best.

>>How long does AMD need there?

>12 cycles... And nVidia takes weel more than 4 cycles too, >I don't know exactly, something between 20 and 30 cycles.

You have really no clue about gpu's otherwise you would've understood exactly what i meant.

It's 4 cycles at both Nvidia as well as AMD GPU's before for simple instructions your result is available.

Let me explain the difference.

An instruction can take 4 such as trivial 'add', but you don't have your results until 5th cycle. Now that doesn't matter if you schedule enough of those instructions as you get the full 2.6 Tflops out of the 6990 then, or the full 750 Gflops that the Tesla's can deliver;

What really influences things is 2 important factors:
a) the gpu cannot issue every cycle such instruction
b) only 1 out of very X units can execute such instruction

It's about A and B here.

Obviously to get the full amount of 'gflops' (integer ops in my case of course) out of the gpu, you want to know how many instructions need to be independantly scheduled; this can be done of course in 2 ways, but you can google for Volkov there for a very good explanation.

As for multiplications, this is really relevant in gpgpu.

In the end most number crunching has to do with matrice of some sort of kind. Also very big numbers you can see as a 1 dimensional matrice and images by definition are matrice.

So multiplications play an important part there.

>>I don't know, but for what i'm programming in opencl right now it's pretty crucial to know.
>If all you care is a 3-way add you should stay away from OpenCL...

I'd be the last on the planet calling OpenCL efficient; it has to prove itself there. Compiler quality plays a major factor there; maybe a chance for intel to catch up a bit there?

Also i am not so sure about how to do adding carry efficiently. Not sure AMD has a carry at all there that you can adress yourself.

>Few who write OpenCL code knows exactly how long each operation takes, it doesn't
>matter, it matter less than on CPUs and still, many CPU programmers can write good
>code while having no idea how long each operation take.
>>Here is another thing we don't know about gpgpu. How *reliable* are results calculated by gpgpu?
>If your code relies on 1+1=2 then it's very reliable.

You're missing the point; nvidia clocks their gpgpu gpu's at 1.2Ghz and with just 448 units; the 512 core gpu's they sell as gamer gpu's, are the exact same chip, just it has suddenly 512 cores and is 1.5Ghz clocked.

See the problem?

>GPUs are deterministic, all integer operations will give the same results a CPU
>would, fp operations may differ from CPU results, but the same input will always
>return the same result up to (I hope for a doc quoting the value) a given precision,
>if you don't rely on the bits beyond this point it's reliable.
>>As all calculations are parallel, in gpgpu you will not have deterministic behaviour.
>The GPU will, but buggy code relying on undefined behaviour won't.

Oh man you really have no clue at all about hardware.

If you have for example at AMD 24 compute units that have their own independant instruction cache, it's obvious they will execute codes at different speeds.

Then after that the next wavefronts get scheduled.

This is completely not-deterministic.

Any claim of determinism shows your ignorance.

>>How many errors do the gpu's make there however in calculations?
>Not more than CPUs in the same conditions...

CPU's are not in the same condition of course.

>>So not RAM errors, but errors of the execution units because of the huge heat and
>This makes Power7 unreliable?

Does power7 only have a single aircooled microtiny fan that has to cool 400+ watt, from which at least a 100+ over specs of the pci-e?

>>without any software out there that checks for correctness.
>It's your option...
>>The cheapest nvidia to do serious gpgpu with is the Quadro 6000. The gamers cards all are lobotomized too much there.
>Now the important is DP and not 32 multiplication anymore?

Both are very important.

When specialistic transforms have been created, then for every field where they manage to define such 32 bits transform, they'll move of course to 32 bits transform there.

Note that this is not so easy; with prime numbers they already struggle a lot to get something to work at a gpu there which in the end would be the same as the optimized SSE2+ DWT that George Woltman programmed.

Right now a dual core the very efficient DWT by George totally is on par with latest Tesla.

The opensource community moves slow there. Progress is 1% each timeframe or so in won efficiency at the gpu of the software implementation.

Honestly i'm a bit amazed that the big math guys didn't get involved in improving it.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
New Article: Predicting GPU Performance for AMD and NvidiaDavid Kanter2011/04/11 11:55 PM
  Graph is not red-green colorblind friendly (NT)RatherNotSay2011/04/12 03:51 AM
    FixedDavid Kanter2011/04/12 08:46 AM
  New Article: Predicting GPU Performance for AMD and NvidiaJames2011/04/12 12:30 PM
    New Article: Predicting GPU Performance for AMD and NvidiaDavid Kanter2011/04/12 02:51 PM
  Try HD6450 or HD6850EduardoS2011/04/12 03:31 PM
    Try HD6450 or HD6850David Kanter2011/04/13 10:25 AM
      Try HD6450 or HD6850EduardoS2011/04/13 03:20 PM
        of causeMoritz2011/04/14 08:03 AM
          of causeEduardoS2011/04/14 01:55 PM
            Barts = 5DMoritz2011/04/14 09:26 PM
              Barts = 5DAntti-Ville Tuunainen2011/04/15 12:38 AM
                Limiting fixed function unitsMoritz2011/04/15 04:28 AM
                  Limiting fixed function unitsVincent Diepeveen2011/04/20 02:38 AM
                    lack of detailMoritz2011/04/20 09:24 AM
                      lack of detailEduardoS2011/04/20 11:45 AM
            gpgpuVincent Diepeveen2011/04/16 02:10 AM
              gpgpuEduardoS2011/04/17 12:31 PM
                gpgpuGroo2011/04/17 12:58 PM
                  gpgpuEduardoS2011/04/17 01:08 PM
                  gpgpuIan Ameline2011/04/18 03:55 PM
                    gpgpuPing-Che Chen2011/04/19 12:59 AM
                      GPU numerical complianceSylvain Collange2011/04/19 11:38 AM
                        GPU numerical complianceVincent Diepeveen2011/04/20 02:17 AM
                gpgpuVincent Diepeveen2011/04/20 02:02 AM
                  gpgpu and core countsHeikki Kultala2011/04/20 04:41 AM
                    gpgpu and core countsVincent Diepeveen2011/04/20 05:52 AM
                      gpgpu and core countsnone2011/04/20 07:05 AM
                        gpgpu and core countsEduardoS2011/04/20 11:36 AM
                      gpgpu and core countsHeikki Kultala2011/04/20 10:16 AM
                        gpgpu and core countsEduardoS2011/04/20 11:34 AM
                          gpgpu and core countsHeikki Kultala2011/04/20 07:24 PM
                            gpgpu and core countsEduardoS2011/04/20 08:55 PM
                              gpgpu and core countsHeikki Kultala2011/04/21 06:48 AM
                                gpgpu and core countsEduardoS2011/04/22 01:41 PM
                              AMD Compute and Texture FetchDavid Kanter2011/04/21 10:42 AM
                                AMD Compute and Texture FetchVincent Diepeveen2011/04/22 01:14 AM
                                  AMD Compute and Texture FetchDavid Kanter2011/04/22 10:53 AM
                                AMD Compute and Texture FetchEduardoS2011/04/22 01:46 PM
                                  AMD Compute and Texture FetchDavid Kanter2011/04/22 02:02 PM
                                    AMD Compute and Texture FetchEduardoS2011/04/22 02:18 PM
                                    AMD Compute and Texture Fetchanon2011/04/22 03:30 PM
                                      AMD Compute and Texture FetchDavid Kanter2011/04/22 09:17 PM
                        gpgpu and core countsVincent Diepeveen2011/04/20 12:12 PM
                          gpgpu and core countsHeikki Kultala2011/04/21 10:23 AM
                            gpgpu and core countsVincent Diepeveen2011/04/22 02:11 AM
                              Keep the crazy politics out of thisDavid Kanter2011/04/22 08:39 AM
                                Keep the crazy politics out of thisVincent Diepeveen2011/04/22 09:12 AM
                                  Keep the crazy politics out of thisDavid Kanter2011/04/22 10:44 AM
                              gpgpu and core countsJouni Osmala2011/04/22 11:06 AM
                  gpgpuEduardoS2011/04/20 11:59 AM
                    gpgpuVincent Diepeveen2011/04/20 12:37 PM
                      gpgpuEduardoS2011/04/20 05:27 PM
                        gpgpuVincent Diepeveen2011/04/21 02:06 AM
                          gpgpuEduardoS2011/04/22 02:00 PM
  New Article: Predicting GPU Performance for AMD and NvidiaPiedPiper2011/04/12 10:05 PM
    New Article: Predicting GPU Performance for AMD and NvidiaDavid Kanter2011/04/12 10:42 PM
      New Article: Predicting GPU Performance for AMD and NvidiaMS2011/04/15 05:04 AM
        New Article: Predicting GPU Performance for AMD and NvidiaKevin G2011/04/16 02:25 AM
          New Article: Predicting GPU Performance for AMD and NvidiaDavid Kanter2011/04/16 08:42 AM
          New Article: Predicting GPU Performance for AMD and NvidiaVincent Diepeveen2011/04/20 02:20 AM
    memoryMoritz2011/04/14 09:03 PM
      memory - moreMoritz2011/04/15 11:11 PM
  New Article: Predicting GPU Performance for AMD and NvidiaKevin G2011/04/14 11:30 AM
Reply to this Topic
Body: No Text
How do you spell tangerine? ūüćä