gpgpu and core counts

Article: Predicting AMD and Nvidia GPU Performance
By: Vincent Diepeveen (, April 22, 2011 2:11 am
Room: Moderated Discussions
Heikki Kultala ( on 4/21/11 wrote:
>The main point in my original message was to comment about using term "core" incorrectly.

In the first place, see the AMD manuals, it is called 'compute unit'.

So the 6990 has 2 x 24 = 48 compute units each existing out of 64 PE's, so 3072 in total.

The word 'core' simply isn't there. In case of AMD it's 48 compute units which form their own world existing out of 64 PE's. As you can see from David's description AMD has done effort in their design to fit the RAM onto it in such manner that they can put 2 gpu's on 1 card.

Nvidia has made a design where this is nearly impossible,
as a result they could use a wider memory path.

Also please note this is inherently 32 bits chips. Nvidia doesn't execute those vectors at once. That's why you'll get out so little gflops.

If it would do that, then you'd be able to get quite some gflops out of it.

448 * 1.215Ghz * 4 (vectorsize) * 2 (multiply-add) = 4.3 Tflop.

It's not even 20% of that practical. Reason is of course it's not executing it as vectors. Even the propaganda department (and you being a fan of world war 2 nazi's on your facebook, the only person under the name Heikki Kultala at the entire internet, and a big fan of 'son of hamas' would probably like the idea of a propaganda department as it is also derived from the 2nd world war) of nvidia which is supposed to optimistically calculate things gives the thing 1030 Gflop single precision. So that's exactly that factor 4 of the vector size, as you bet they double count the multiply-add.

Note i'm not claiming they won't do next generation a lot better, as in 22 nm of course that's a big opportunity for everyone to improve dramatically their chippie.

So right now comparing a core of nvidia with 4 of AMD is inherently wrong, as practical the performance is about on par. So 4 PE's of AMD when at the same clock are between up to 4x as fast as 1 PE of Nvidia, on a throughput basis.

So the only correct way to compare is in that manner if we discuss gpgpu. In games, and please realize i also have been producing games so i know quite some more there than the simple English i use for it might reveal, it's about bandwidth to the RAM and clockspeed of the GPU; so latency plays a role there as well.

Nvidia has a superior latency over AMD-GPU's, and the reason why that's showing in those games is obviously because they didn't perfectly program for those GPU's.

If you look to most graphics these guys simply never learned all sorts of things like what is an algorithm. In fact several in their 3d engines are not even sorting their stuff, which with todays huge graphics already would speedup their graphics major league. You know a simple datastructure to keep things sorted instead of needing to travel through everything, such complete basic methods of speeding up didn't get done in several cases of those 3d engines.

As a result they need massive GPU brute force power.

Not a single game is fully exploiting the possibilities of those graphics cards; the biggest bottleneck isn't in fact the 3d engines; those get improved when they need to get improved somewhere; it's the graphics. The designers need to draw everything by hand that they design, so the bottleneck is the amount of lines they simply draw.

That all is so total different from gpgpu.

In GPGPU we care for how fast the PE's can deliver us work.

To get out of both the optimum performance in a throughput manner out of Nvidia as well as AMD, you'll need the same programming techniques. We can safely assume that in some years the HPC manufacturers also will add to this.

Now that said, the opencl compiler plays a large role and in the long run you never know which manufacturer delivers the best compiler there.

From cpu world we know it can really make or break a design.

>You seem to have read lots text between my lines that is not there

You are comparing heads on and where you do not know the latest terminology as used by AMD, as you didn't even bother reading their documentation, you do accuse me of using the wrong terminology.

>Vincent Diepeveen ( on 4/20/11 wrote:
>>Heikki Kultala ( on 4/20/11 wrote:
>>>Vincent Diepeveen ( on 4/20/11 wrote:
>>>>Heikki Kultala ( on 4/20/11 wrote:
>>>>>>It's simple. It's 3072 cores @ 0.88Ghz versus nvidia 448 cores @ 1.2Ghz. 3072 cores
>>>>>>always win by factor 3-4 then or so.
>>>>>>No discussions there.
>>>>>It's 24 ATI cores per chip versus 28-32 nvidia cores per chip.
>>>>>And, it's 384 ATI SPMD lanes per chip versus 448-512 nvidia SPMD lanes per chip.
>>>>>VLIW ALU != core
>>>>I wrote it in popular language, but that doesn't stop idiots like you.
>>>Says how things technically really are makes me an idiot?
>>Because first of all you say it wrong and also deliberately.
>>It has 2 gpu's and therefore 48 COMPUTE UNITS. Not 24.
>I Assume everyone can do the multiplication by 2 themselves.
>I just posted what one chip has.
>>Secondly you want to refer to some lobotomized overclocked gamers card of nvidia,
>>but let's just look at their topend gpgpu card, as we're discussing gpgpu here, not gamers.
>I did not refer to anything noone else referred before me.
>I used numbers "28-32" and "448-512" so that you can pick the model YOU want to
>compare to. (and I just said what the chip HAS, so I used also the full numbers
>even thought on most models some are disabled)
>>For gamers you can find dozens of others sites that are better and benchmarking
>>every game already.
>Now you are talking to wrong address, game performance is not very interesting for me.
>>>>It's 3072 PE's versus 448 PE's.
>>>Those are unit counts coming from marketting.
>>>Those are FP ALU counts. How to feed data to those matters.
>>>There are only 384 SPMD lanes per Cayman chip,
>>>so one chip can execute only 384 VLIW instructions per cycle (if the program counters
>>>are correctly aligned, if not, the worst case is 24 / chip)
>>there's 2 chips and you know this very well. You have to compare 48 compute units
>>and 3072 PE's with the 448 ones of nvidia.
>yes, I know that.
>>>But, those VLIW operations can actually include 5, not 4 operations total. (but
>>>the one has to be branch, only 4 FP operations).
>>>What makes this more complicated is how memory operations are handled.
>>>If memory operation is being handled, no ALU operations can execute at same time
>>>on ATI. I'm not sure how this goes on nvidia, might be similar.
>>Actually, description David gave of how the RAM functions is wrong, but as i'm
>>not a memory expert i'll not comment too much on it. I heard someone call it: "piece
>>of crap description. Sure it can already load it while the alu's work.
>ok, I was wrong in this one, I thought only one clause can start executing at same time.
>>>>What's going to be FASTER in well designed gpgpu codes?
>>>>0.83Ghz * 3072 PE's is *always* going to annihilate in well designed codes a meager 448 PE's @ 1.2Ghz Tesla.
>>>Did I say they would not? No.
>>But you try to scave off factor 2 of the performance the cayman delivers, as they
>>come in 2 gpu's at 1 card for 500 and a little euro's.
>No. I clearly said these are the numbers for single chip and I assumed you can
>do the multiplication by two and get the numbers if you are comparing performance
>of two-chip card. But it seems your brains cannot do multiplication by 2 without
>raising "someone is doing some incorrect comparison"-interrupt.
>>>I just told you are counting your cores incorrectly. And telling you are wrong seems to make me an idiot.
>>You're comparing in the wrong manner.
>>You must compare the number of PE's with each other as that's what the effective
>>speed is of both those gpu's that they'll deliver you if you write great code for it.
>>If your argumentation is: "but my code isn't optimal" that's not relevant in this discussion, see what i wrote.
>My code is quite good, but code of average coder is not.
>>>But if the code does not have any ILP, only parallelism between work items, then
>>>3/4 or those ATI ALU's are idling. This practically means badly optimized code.
>>We know from the well optimized codes that the better gpu coders get a higher %
>>out of AMD than out of Nvidia cards pre-tesla. Tesla is a big big improvement there,
>>yet even if you manage to get a slighly higher % out of it thanks to CUDA being
>>a better low level language than opencl (think of adding carries fast in cuda -
>>try that in opencl), that still is peanuts. If you manage to get 700 gflops out
>>of that tesla, counted in a theoretic manner (counting multiply-add as 2 flops), then you're a hero.
>>This whereas getting a practical 2 Tflop out of the 6990 is not even remote to hero status.
>Not a remote hero status, but might still require things like putting many "logical
>work items" into one actual work item.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
New Article: Predicting GPU Performance for AMD and NvidiaDavid Kanter2011/04/11 11:55 PM
  Graph is not red-green colorblind friendly (NT)RatherNotSay2011/04/12 03:51 AM
    FixedDavid Kanter2011/04/12 08:46 AM
  New Article: Predicting GPU Performance for AMD and NvidiaJames2011/04/12 12:30 PM
    New Article: Predicting GPU Performance for AMD and NvidiaDavid Kanter2011/04/12 02:51 PM
  Try HD6450 or HD6850EduardoS2011/04/12 03:31 PM
    Try HD6450 or HD6850David Kanter2011/04/13 10:25 AM
      Try HD6450 or HD6850EduardoS2011/04/13 03:20 PM
        of causeMoritz2011/04/14 08:03 AM
          of causeEduardoS2011/04/14 01:55 PM
            Barts = 5DMoritz2011/04/14 09:26 PM
              Barts = 5DAntti-Ville Tuunainen2011/04/15 12:38 AM
                Limiting fixed function unitsMoritz2011/04/15 04:28 AM
                  Limiting fixed function unitsVincent Diepeveen2011/04/20 02:38 AM
                    lack of detailMoritz2011/04/20 09:24 AM
                      lack of detailEduardoS2011/04/20 11:45 AM
            gpgpuVincent Diepeveen2011/04/16 02:10 AM
              gpgpuEduardoS2011/04/17 12:31 PM
                gpgpuGroo2011/04/17 12:58 PM
                  gpgpuEduardoS2011/04/17 01:08 PM
                  gpgpuIan Ameline2011/04/18 03:55 PM
                    gpgpuPing-Che Chen2011/04/19 12:59 AM
                      GPU numerical complianceSylvain Collange2011/04/19 11:38 AM
                        GPU numerical complianceVincent Diepeveen2011/04/20 02:17 AM
                gpgpuVincent Diepeveen2011/04/20 02:02 AM
                  gpgpu and core countsHeikki Kultala2011/04/20 04:41 AM
                    gpgpu and core countsVincent Diepeveen2011/04/20 05:52 AM
                      gpgpu and core countsnone2011/04/20 07:05 AM
                        gpgpu and core countsEduardoS2011/04/20 11:36 AM
                      gpgpu and core countsHeikki Kultala2011/04/20 10:16 AM
                        gpgpu and core countsEduardoS2011/04/20 11:34 AM
                          gpgpu and core countsHeikki Kultala2011/04/20 07:24 PM
                            gpgpu and core countsEduardoS2011/04/20 08:55 PM
                              gpgpu and core countsHeikki Kultala2011/04/21 06:48 AM
                                gpgpu and core countsEduardoS2011/04/22 01:41 PM
                              AMD Compute and Texture FetchDavid Kanter2011/04/21 10:42 AM
                                AMD Compute and Texture FetchVincent Diepeveen2011/04/22 01:14 AM
                                  AMD Compute and Texture FetchDavid Kanter2011/04/22 10:53 AM
                                AMD Compute and Texture FetchEduardoS2011/04/22 01:46 PM
                                  AMD Compute and Texture FetchDavid Kanter2011/04/22 02:02 PM
                                    AMD Compute and Texture FetchEduardoS2011/04/22 02:18 PM
                                    AMD Compute and Texture Fetchanon2011/04/22 03:30 PM
                                      AMD Compute and Texture FetchDavid Kanter2011/04/22 09:17 PM
                        gpgpu and core countsVincent Diepeveen2011/04/20 12:12 PM
                          gpgpu and core countsHeikki Kultala2011/04/21 10:23 AM
                            gpgpu and core countsVincent Diepeveen2011/04/22 02:11 AM
                              Keep the crazy politics out of thisDavid Kanter2011/04/22 08:39 AM
                                Keep the crazy politics out of thisVincent Diepeveen2011/04/22 09:12 AM
                                  Keep the crazy politics out of thisDavid Kanter2011/04/22 10:44 AM
                              gpgpu and core countsJouni Osmala2011/04/22 11:06 AM
                  gpgpuEduardoS2011/04/20 11:59 AM
                    gpgpuVincent Diepeveen2011/04/20 12:37 PM
                      gpgpuEduardoS2011/04/20 05:27 PM
                        gpgpuVincent Diepeveen2011/04/21 02:06 AM
                          gpgpuEduardoS2011/04/22 02:00 PM
  New Article: Predicting GPU Performance for AMD and NvidiaPiedPiper2011/04/12 10:05 PM
    New Article: Predicting GPU Performance for AMD and NvidiaDavid Kanter2011/04/12 10:42 PM
      New Article: Predicting GPU Performance for AMD and NvidiaMS2011/04/15 05:04 AM
        New Article: Predicting GPU Performance for AMD and NvidiaKevin G2011/04/16 02:25 AM
          New Article: Predicting GPU Performance for AMD and NvidiaDavid Kanter2011/04/16 08:42 AM
          New Article: Predicting GPU Performance for AMD and NvidiaVincent Diepeveen2011/04/20 02:20 AM
    memoryMoritz2011/04/14 09:03 PM
      memory - moreMoritz2011/04/15 11:11 PM
  New Article: Predicting GPU Performance for AMD and NvidiaKevin G2011/04/14 11:30 AM
Reply to this Topic
Body: No Text
How do you spell tangerine? ūüćä