Article: Parallelism at HotPar 2010
By: Richard (no.delete@this.email.com), August 25, 2010 12:03 pm
Room: Moderated Discussions
>Look it's rather trivial, if you want to program some problem onto the GPU that
>you previously did do at the cpu's, then also in my calculations the speedup is
>roughly 5x to 10x, pure theoretic spoken; not practical yet, over a quadcore.
How can you make such a claim, this totally depends on the algorithm at hand. When porting algorithms to the GPU you have to take into account issues such as SIMD width, number of concurrent kernels which can be run, not using too many registers so you run enough 'threads' to hide the latency to main memory, finding an algorithm which can make effective use of local store, having enough IPC (Ati's VLIW) and much more. Ah wait, it is theoretical, are we counting ALUs?
>Yet that advantage
>
>a) is only for AMD, not for nvidia
>
>Nvidia is SLOWER in most gpgpu related tasks.
No, it is not.
>Amazingly all postings i see everywhere are regarding nvidia, ...
There is no conspiracy, Nvidia's development environment is just miles ahead of Ati's.
>... but realize nvidia
>has 240 cores and AMD has 3200 streamcores. That's more like factor 10 more or something.
That makes no sense whatsoever. The number of ALUs says nothing about total performance, those 'number of cores' numbers are just for the marketing guys which try to market ALUs as cores.
GPUs don't really have cores, but rather SIMDs. Fermi has 16 of them, ATI's Evergreen has 20. You can run 16 concurrent kernels on Fermi, but just one on ATi at the moment (the hardware probably supports 4). To make things more complex in Ati's architecture there are 16 wide SIMDs which have 5 wide VLIW per lane, NVIDIA's SIMDs are now also 16 wide (but are exposed as 32 wide) and scalar. Nvidia's SIMDS clock much higher than Ati's do. There is an excellent article by David Kanter on this very site about Fermi, which describes Fermi's architecture in great detail.
>Now of course with many games that run on gpu's, it is the case that this opengl
>and directx simply doesn't parallellize THAT WELL, that it can use the full potential
>of 3200 streamcores; otherwise in every benchmark of course nvidia would lose it bigtime to AMD.
Actually, games can use those 'cores' really well. That is why ATI's architecture does so well.
>Yet in gpgpu, you CAN use all those streamcores very nice, if you do big effort.
No, effort will not help you if your algorithm is not portable to a SIMD architecture. Furthermore, on ATI the DirectX driver can use the concurrent kernel functions supported by the hardware, while the OpenCL programmers cannot.
>reports are right now 25%-30% IPC at nvidia and 50% at AMD that the gpu programmers achieve for the same application.
Uhm, compared to CPUs? Where did you find such numbers?
Look, in my view, the GPU is not in direct competition with the CPU, it is just that some problems, not most, map really well to the SIMD paradigm and this makes the GPU interesting to use for that subset of problems. Eventually, we'll have CPUs with better SIMDs like Larrabee's SIMD extensions. If OpenCL is supported better at that time that would mean you can run your parallel algorithm on the GPU, CPU or both. CUDA is a nice stop-gap solution in the meantime.
>you previously did do at the cpu's, then also in my calculations the speedup is
>roughly 5x to 10x, pure theoretic spoken; not practical yet, over a quadcore.
How can you make such a claim, this totally depends on the algorithm at hand. When porting algorithms to the GPU you have to take into account issues such as SIMD width, number of concurrent kernels which can be run, not using too many registers so you run enough 'threads' to hide the latency to main memory, finding an algorithm which can make effective use of local store, having enough IPC (Ati's VLIW) and much more. Ah wait, it is theoretical, are we counting ALUs?
>Yet that advantage
>
>a) is only for AMD, not for nvidia
>
>Nvidia is SLOWER in most gpgpu related tasks.
No, it is not.
>Amazingly all postings i see everywhere are regarding nvidia, ...
There is no conspiracy, Nvidia's development environment is just miles ahead of Ati's.
>... but realize nvidia
>has 240 cores and AMD has 3200 streamcores. That's more like factor 10 more or something.
That makes no sense whatsoever. The number of ALUs says nothing about total performance, those 'number of cores' numbers are just for the marketing guys which try to market ALUs as cores.
GPUs don't really have cores, but rather SIMDs. Fermi has 16 of them, ATI's Evergreen has 20. You can run 16 concurrent kernels on Fermi, but just one on ATi at the moment (the hardware probably supports 4). To make things more complex in Ati's architecture there are 16 wide SIMDs which have 5 wide VLIW per lane, NVIDIA's SIMDs are now also 16 wide (but are exposed as 32 wide) and scalar. Nvidia's SIMDS clock much higher than Ati's do. There is an excellent article by David Kanter on this very site about Fermi, which describes Fermi's architecture in great detail.
>Now of course with many games that run on gpu's, it is the case that this opengl
>and directx simply doesn't parallellize THAT WELL, that it can use the full potential
>of 3200 streamcores; otherwise in every benchmark of course nvidia would lose it bigtime to AMD.
Actually, games can use those 'cores' really well. That is why ATI's architecture does so well.
>Yet in gpgpu, you CAN use all those streamcores very nice, if you do big effort.
No, effort will not help you if your algorithm is not portable to a SIMD architecture. Furthermore, on ATI the DirectX driver can use the concurrent kernel functions supported by the hardware, while the OpenCL programmers cannot.
>reports are right now 25%-30% IPC at nvidia and 50% at AMD that the gpu programmers achieve for the same application.
Uhm, compared to CPUs? Where did you find such numbers?
Look, in my view, the GPU is not in direct competition with the CPU, it is just that some problems, not most, map really well to the SIMD paradigm and this makes the GPU interesting to use for that subset of problems. Eventually, we'll have CPUs with better SIMDs like Larrabee's SIMD extensions. If OpenCL is supported better at that time that would mean you can run your parallel algorithm on the GPU, CPU or both. CUDA is a nice stop-gap solution in the meantime.