By: Vincent Diepeveen (diep.delete@this.xs4all.nl), April 20, 2011 12:37 pm
Room: Moderated Discussions
EduardoS (no@spam.com) on 4/20/11 wrote:
---------------------------
>Vincent Diepeveen (diep@xs4all.nl) on 4/20/11 wrote:
>---------------------------
>>I'll have to benchmark it a lot anyway which type of multiplication is fastest.
>
>Also, you can look at the generated code to see if there is more than one mul for each VLIW instruction...
there's a mulhi and a mullo in opencl. both are different instructions.
>tip: I have looked even knowing the answear...
>
>>Only Volkov mentions it.
>
>I didn't heard that from Volkov... BTW I think you are giving too much credit for
>him, there are many more people involved, some more involved and for more time...
Give a link to additional info in such case.
>>It's 4 cycles at both Nvidia as well as AMD GPU's before for simple instructions your result is available.
>
>It's for 4 cycles for the instruction to be issued for all lanes, in ATI case it's
>more 4 cycles for another wavefront to issue instructions and maybe for the first
>wavefront to execute then finally more 4 cycles to issue the next instruction, maybe
>a more correct number would be 16 because only after 16 cycles the result would
>be avaliable, anyway, this doesn't matter.
It's a matter of benchmarking until you got what you want.
>BTW, in nVidia case starting on Fermi it's just 2 cycles to issue the wrap, the number of cycles to execute vary.
>>An instruction can take 4 such as trivial 'add', but you don't have your results
>>until 5th cycle.
>
>Until the 9th cycle on AMD hardware.
That would be easy to test with a small testproggie. I'll do it as this is pretty crucial.
>>Any claim of determinism shows your ignorance.
>
>Same contitions, same results = deterministic
Very wrong, you still have a lot to learn about parallellism.
Parallellism by definition is not deterministic.
The next wavefront at different compute units don't even execute at the exact same atomic time, so it's pretty crucial which one starts where for certain codes.
It's very simple to execute 3072 times the same program; much harder it is to parallellize something that is inherently sequential and still gets a massive speedup from it at a gpu, against just a little extra overhead compared to a cpu (say factor 2 overhead needed over what a cpu needs).
>>Does power7 only have a single aircooled microtiny fan that has to cool 400+ watt,
>>from which at least a 100+ over specs of the pci-e?
>
>It's 250W if I remember correctly, for z196 it's even more, draw your own conclusions.
>
google for furmark and radeon 6990 it's eating 450+ watt.
The CUDA equivalent of this prime number code over here at a tesla eats far over 405+ watt as well, same number as furmark gives in short.
Vincent
---------------------------
>Vincent Diepeveen (diep@xs4all.nl) on 4/20/11 wrote:
>---------------------------
>>I'll have to benchmark it a lot anyway which type of multiplication is fastest.
>
>Also, you can look at the generated code to see if there is more than one mul for each VLIW instruction...
there's a mulhi and a mullo in opencl. both are different instructions.
>tip: I have looked even knowing the answear...
>
>>Only Volkov mentions it.
>
>I didn't heard that from Volkov... BTW I think you are giving too much credit for
>him, there are many more people involved, some more involved and for more time...
Give a link to additional info in such case.
>>It's 4 cycles at both Nvidia as well as AMD GPU's before for simple instructions your result is available.
>
>It's for 4 cycles for the instruction to be issued for all lanes, in ATI case it's
>more 4 cycles for another wavefront to issue instructions and maybe for the first
>wavefront to execute then finally more 4 cycles to issue the next instruction, maybe
>a more correct number would be 16 because only after 16 cycles the result would
>be avaliable, anyway, this doesn't matter.
It's a matter of benchmarking until you got what you want.
>BTW, in nVidia case starting on Fermi it's just 2 cycles to issue the wrap, the number of cycles to execute vary.
>>An instruction can take 4 such as trivial 'add', but you don't have your results
>>until 5th cycle.
>
>Until the 9th cycle on AMD hardware.
That would be easy to test with a small testproggie. I'll do it as this is pretty crucial.
>>Any claim of determinism shows your ignorance.
>
>Same contitions, same results = deterministic
Very wrong, you still have a lot to learn about parallellism.
Parallellism by definition is not deterministic.
The next wavefront at different compute units don't even execute at the exact same atomic time, so it's pretty crucial which one starts where for certain codes.
It's very simple to execute 3072 times the same program; much harder it is to parallellize something that is inherently sequential and still gets a massive speedup from it at a gpu, against just a little extra overhead compared to a cpu (say factor 2 overhead needed over what a cpu needs).
>>Does power7 only have a single aircooled microtiny fan that has to cool 400+ watt,
>>from which at least a 100+ over specs of the pci-e?
>
>It's 250W if I remember correctly, for z196 it's even more, draw your own conclusions.
>
google for furmark and radeon 6990 it's eating 450+ watt.
The CUDA equivalent of this prime number code over here at a tesla eats far over 405+ watt as well, same number as furmark gives in short.
Vincent
Topic | Posted By | Date |
---|---|---|
New Article: Predicting GPU Performance for AMD and Nvidia | David Kanter | 2011/04/11 11:55 PM |
Graph is not red-green colorblind friendly (NT) | RatherNotSay | 2011/04/12 03:51 AM |
Fixed | David Kanter | 2011/04/12 08:46 AM |
New Article: Predicting GPU Performance for AMD and Nvidia | James | 2011/04/12 12:30 PM |
New Article: Predicting GPU Performance for AMD and Nvidia | David Kanter | 2011/04/12 02:51 PM |
Try HD6450 or HD6850 | EduardoS | 2011/04/12 03:31 PM |
Try HD6450 or HD6850 | David Kanter | 2011/04/13 10:25 AM |
Try HD6450 or HD6850 | EduardoS | 2011/04/13 03:20 PM |
of cause | Moritz | 2011/04/14 08:03 AM |
of cause | EduardoS | 2011/04/14 01:55 PM |
Barts = 5D | Moritz | 2011/04/14 09:26 PM |
Barts = 5D | Antti-Ville Tuunainen | 2011/04/15 12:38 AM |
Limiting fixed function units | Moritz | 2011/04/15 04:28 AM |
Limiting fixed function units | Vincent Diepeveen | 2011/04/20 02:38 AM |
lack of detail | Moritz | 2011/04/20 09:24 AM |
lack of detail | EduardoS | 2011/04/20 11:45 AM |
gpgpu | Vincent Diepeveen | 2011/04/16 02:10 AM |
gpgpu | EduardoS | 2011/04/17 12:31 PM |
gpgpu | Groo | 2011/04/17 12:58 PM |
gpgpu | EduardoS | 2011/04/17 01:08 PM |
gpgpu | Ian Ameline | 2011/04/18 03:55 PM |
gpgpu | Ping-Che Chen | 2011/04/19 12:59 AM |
GPU numerical compliance | Sylvain Collange | 2011/04/19 11:38 AM |
GPU numerical compliance | Vincent Diepeveen | 2011/04/20 02:17 AM |
gpgpu | Vincent Diepeveen | 2011/04/20 02:02 AM |
gpgpu and core counts | Heikki Kultala | 2011/04/20 04:41 AM |
gpgpu and core counts | Vincent Diepeveen | 2011/04/20 05:52 AM |
gpgpu and core counts | none | 2011/04/20 07:05 AM |
gpgpu and core counts | EduardoS | 2011/04/20 11:36 AM |
gpgpu and core counts | Heikki Kultala | 2011/04/20 10:16 AM |
gpgpu and core counts | EduardoS | 2011/04/20 11:34 AM |
gpgpu and core counts | Heikki Kultala | 2011/04/20 07:24 PM |
gpgpu and core counts | EduardoS | 2011/04/20 08:55 PM |
gpgpu and core counts | Heikki Kultala | 2011/04/21 06:48 AM |
gpgpu and core counts | EduardoS | 2011/04/22 01:41 PM |
AMD Compute and Texture Fetch | David Kanter | 2011/04/21 10:42 AM |
AMD Compute and Texture Fetch | Vincent Diepeveen | 2011/04/22 01:14 AM |
AMD Compute and Texture Fetch | David Kanter | 2011/04/22 10:53 AM |
AMD Compute and Texture Fetch | EduardoS | 2011/04/22 01:46 PM |
AMD Compute and Texture Fetch | David Kanter | 2011/04/22 02:02 PM |
AMD Compute and Texture Fetch | EduardoS | 2011/04/22 02:18 PM |
AMD Compute and Texture Fetch | anon | 2011/04/22 03:30 PM |
AMD Compute and Texture Fetch | David Kanter | 2011/04/22 09:17 PM |
gpgpu and core counts | Vincent Diepeveen | 2011/04/20 12:12 PM |
gpgpu and core counts | Heikki Kultala | 2011/04/21 10:23 AM |
gpgpu and core counts | Vincent Diepeveen | 2011/04/22 02:11 AM |
Keep the crazy politics out of this | David Kanter | 2011/04/22 08:39 AM |
Keep the crazy politics out of this | Vincent Diepeveen | 2011/04/22 09:12 AM |
Keep the crazy politics out of this | David Kanter | 2011/04/22 10:44 AM |
gpgpu and core counts | Jouni Osmala | 2011/04/22 11:06 AM |
gpgpu | EduardoS | 2011/04/20 11:59 AM |
gpgpu | Vincent Diepeveen | 2011/04/20 12:37 PM |
gpgpu | EduardoS | 2011/04/20 05:27 PM |
gpgpu | Vincent Diepeveen | 2011/04/21 02:06 AM |
gpgpu | EduardoS | 2011/04/22 02:00 PM |
New Article: Predicting GPU Performance for AMD and Nvidia | PiedPiper | 2011/04/12 10:05 PM |
New Article: Predicting GPU Performance for AMD and Nvidia | David Kanter | 2011/04/12 10:42 PM |
New Article: Predicting GPU Performance for AMD and Nvidia | MS | 2011/04/15 05:04 AM |
New Article: Predicting GPU Performance for AMD and Nvidia | Kevin G | 2011/04/16 02:25 AM |
New Article: Predicting GPU Performance for AMD and Nvidia | David Kanter | 2011/04/16 08:42 AM |
New Article: Predicting GPU Performance for AMD and Nvidia | Vincent Diepeveen | 2011/04/20 02:20 AM |
memory | Moritz | 2011/04/14 09:03 PM |
memory - more | Moritz | 2011/04/15 11:11 PM |
New Article: Predicting GPU Performance for AMD and Nvidia | Kevin G | 2011/04/14 11:30 AM |