Article: Parallelism at HotPar 2010
By: Anon (no.delete@this.email.com), July 28, 2010 4:18 pm
Room: Moderated Discussions
Thank you for your detailed reply, I hope my rather hurried initial query did not seem too critical.
I am also quite interested in the level of optimisations, because optimising GPU code is one of my major functions these days (well, also directing others in the right direction).
Interestingly also I cannot comment too directly on the codes you did use (however the information is useful) as we very rarely use pre-existing libraries ourselves, and hand-tune all our code per application.
I would, as you say, assume that NVidia have competent matrix libraries, so long as the data layouts, etc are suitable.
One of the very enlightening tools we use when tuning is a 'soak' application that runs semi-indepentently and can be asked to consume GPU compute, cache, external bandwidth, bus bandwidth, etc as required - we use this to verify that our codes are fully utilising a specific area of capability, and often it allows us to find the areas we are not - which is sometimes surprising/enlightening.
I will be reading your mentioned papers with interest (once I can clear a few projects, grumble..)
Small-n is always an issue, and we actually often push these cases back to CPU (well, never move them off CPU), as they are not only inefficient, but can actually cause large performance losses in other simultaneous tasks through stalling of significant resources. Luckily many of our datasets are TByte...
As to question 2, I hope I didnt overstate that I had spoken to NVidia about these cases, I was simply reading the comment that NVidia seemed to indicate that these were not what they would consider target applications..
There are most definately issues with how parts of NVidia choose to promote 'GPGPU' (a term certainly not originated from NVidia, and one I dislike), and I dont believe that many in 'the industry' really believe that GPUs are even close to being GP.. They are a very good tool for a somewhat restricted subset of applications, where runtimes and datasets are suitable large, codes map well to the GPU model, and development times can be suitable 'extended'..
IMHO, the GP in GPGPU is heavily mis-understood - from my position it if refering to the fact that, unlike not that long ago, these days you can use what appear to be 'normal' languages, and write code that makes use of compute-type interfaces and functionality, whereas in the past using a GPU was a matter of making all problems look like image rendering - a much more difficult and limiting task.
I well remember when readback from a GPU was (artificially) limited to PCI speed on AGP busses, because the vendors did not care about readback..
To me, 'General Purpose' means the GPU is no longer just about transformation, lighting, and scanconversion..
I feel that our viewpoints are very aligned, and I would most strongly agree with your heterogeneous systems view. I do find it very tiresome that there are 2 other camps that seem to feel GPU are a threat or target for some reason.
1 - the 'parallel is too hard, and doesnt work anyway!' crowd, how want everything to be considered as scalar code, and only want faster CPUs - I also love faster scalar CPUs, with faster GPUs alongside!
2 - the 'GPUs are toys' brigade, who like to point out the weaknesses of GPUs (of which there are many!) and ignore their strengths, they often like to point to systems 20% faster with 20 times the budget..
I myself believe that the GPGPU approach has opened a whole new area of price/performance for a range of important codes, however that range is somewhat limited, which is probably for the best - to make a truely general-purpose GPU would probably reduce its performance to that of a general purpose CPU (surprise!), and Intel probably do that better anyway!
Rich Vuduc (richie@cc.gatech.edu) on 7/28/10 wrote:
---------------------------
>You raise two fair questions, which I'd like to address.
>
>[Question 1] Regarding how well tuned the GPU codes are, I'd be hypocritically
>violating Bailey's Rule #6 (use a bad baseline) if we didn't put in some effort.
>Whether the effort is *fair* is up for grabs, but I'll say the following:
>
>(a) For sparse matrix-vector multiply, the "best" GPU codes shown are the best
>of NVIDIA's implementations by Bell & Garland, as well as those of my student, Jee
>Choi. These are all tuned on the GPU fairly well.
>
>(b) For the sparse direct solver, we are using CUBLAS. It's debatable whether these
>implementations are the best out there, but one would trust they are reasonably
>well-tuned, though things could be better.
>
>(c) For the fast multipole method (FMM)---whose results are shown in single-precision,
>by the way---we did not write a Mickey Mouse code. The FMM uses a "direct 'n^2'
>n-body" computation as a subroutine. The updated version of our HotPar'10 talk slides,
>which appeared in a Dept. of Energy-hosted meeting called "SciDAC'10", show that
>when 'n' is sufficiently large this subroutine gets 640 Gflop/s (65% of peak) on
>a Fermi card. So, even if it's not the best code out there, I think we can claim
>it's not unreasonable. The only problem is that for the FMM, this subroutine has
>to run fast when 'n' is relatively much smaller, which is where the GPU advantage
>decreases. A fair question, then, is whether we can make this subroutine fast for
>small 'n'. We are actively doing this on the GPU this summer, because we see the
>GPU as an integral assistant in a full FMM code on a likely future system with both
>multicore CPU and GPU components. (Shameless plug: We are part of a team whose upcoming
>paper at Supercomputing'10 uses both CPU and GPU for the FMM.) Now, we're not done
>yet but hope to make significant in-roads for this case.
>
>[Question 2] You say that NVIDIA does not consider the computations we considered
>to be the prime targets of their systems. I suppose this is possible. However, they
>clearly have a high-performance computing strategy, and in HPC, we care about things
>like sparse iterative and direct solvers, as well as scalable n-body problems. I
>think physically-realistic games and graphics care about these things, too, though
>I'll admit right away that I'm not an expert on those kinds of apps. But just to
>throw it out there, a friend of mine at Lucas Arts, who led the physics engine development
>on The Force Unleased game, uses a finite-element solver to simulate how objects
>deform when you use the force on them. So if the computations we care about are
>not within the scope of what a GPU should be good at, it begs the question in my
>mind of how "general-purpose" a GPGPU is.
>
>I'd like to conclude by saying that I'm a big believer in heterogeneous systems
>with GPU components! We do a lot of GPU work at Georgia Tech and are heavily investing
>our research efforts on how to use these systems. The only real point of the talk
>was to say that, for the benefit of the applications development community that
>has to spend time writing all this code, we should forget the marketing hype, set
>realistic expectations, and do the hard work of figuring out how best to use these
>computational resources and building better tools.
>
>-- Rich V. @ Georgia Tech
>
>Anon (no@email.com) on 7/27/10 wrote:
>---------------------------
>>David Kanter (dkanter@realworldtech.com) on 7/27/10 wrote:
>>---------------------------
>>>Today is shaping up to be an excellent day for a number of reasons:
>>>
>>>We have an excellent new contributor, Tarek Chammah. Tarek is a graduate student
>>>at the University of Waterloo who specializes in software approaches to parallelization,
>>>including run-times, languages, APIs, etc.
>>>
>>>I recently had the opportunity to go to lunch with Tarek and we had an excellent
>>>time and I learned quite a lot about the trends on the software side of the equation.
>>>One of the points that Tarek emphatically made is that with the emergence of parallel
>>>processing, the software is becoming equally important as the hardware. Times are
>>>a changing, and it's not just about your good old compiler getting code into shape
>>>for the hardware; software is truly an essential part of the glue that binds together
>>>the system, and I hope to be able to discuss software more in the future at RWT.
>>>
>>>Second, Tarek has provided us with an excellent article covering some of the highlights
>>>of the HotPar 2010 workshop. Hot Par was held in Berkeley this year, and included
>>>a fair number of papers - but almost all of them were software focused. This is
>>>a nice change of pace from our usual coverage of ISSCC, Hot Chips or IEDM:
>>>
>>>http://www.realworldtech.com/page.cfm?ArticleID=RWT072610001641
>>>
>>>Please join me in thanking Tarek for his contribution, and I look forward to some lively discussions.
>>>
>>>
>>>David
>>
>>I would most certainly agree that more input on this subject is most welcome.
>>
>>I have two discussion questions for the section "The Limits of GPUs"
>>
>>Firstly, a lot of this content seem to run along the lines of 'untuned Intel xxx
>>code was a lot slower than the GPU, but then we spent a lot of time rewriting/tuning
>>the CPu code, and it got faster!' however no mention seems to be made of similar tuning efforts in GPU code.
>>I think most of the people involved in GPU Cuda programming will agree that it
>>is significantly HARDER to extract full potential from GPU code, although for 'suitable'
>>codes the gains are even larger - this looks/feels like one of these cases.... highly
>>tuned CPU code versus basic GPU code.
>>
>>It is also interesting (and the full information is not presented) that in the
>>second group of cases, we seems to be comparing DP codes on Tesla C1060, rather than the most certanly current 2070.
>>Now, a C1060 has around an 8:1 SP:DP ratio. The 2060 closer to 2:1, and nearly
>>7 TIMES the peak DP with a single GPU than the 1060... I do not doubt that the Nehalem
>>system is not the fastest current either, however I doubt a system 7 times faster could be found.
>>Secondly in this case, the codes being looked at are, as NVidia appears to have
>>pointed out, not really prime targets of their systems anyway (any yet their OLD systems do pretty well).
>>
>>Now, this could be seen as valid 'limits' of GPUs.
>>1 - Older implementations (and some current) are not great at DP.
>>2 - GPUs are very optimisation sensitive (tools are quite new, and they are not that flexible compute devices)
>>3 - GPUs performance varies strongly, not all target applications are suitable.
>>
>
I am also quite interested in the level of optimisations, because optimising GPU code is one of my major functions these days (well, also directing others in the right direction).
Interestingly also I cannot comment too directly on the codes you did use (however the information is useful) as we very rarely use pre-existing libraries ourselves, and hand-tune all our code per application.
I would, as you say, assume that NVidia have competent matrix libraries, so long as the data layouts, etc are suitable.
One of the very enlightening tools we use when tuning is a 'soak' application that runs semi-indepentently and can be asked to consume GPU compute, cache, external bandwidth, bus bandwidth, etc as required - we use this to verify that our codes are fully utilising a specific area of capability, and often it allows us to find the areas we are not - which is sometimes surprising/enlightening.
I will be reading your mentioned papers with interest (once I can clear a few projects, grumble..)
Small-n is always an issue, and we actually often push these cases back to CPU (well, never move them off CPU), as they are not only inefficient, but can actually cause large performance losses in other simultaneous tasks through stalling of significant resources. Luckily many of our datasets are TByte...
As to question 2, I hope I didnt overstate that I had spoken to NVidia about these cases, I was simply reading the comment that NVidia seemed to indicate that these were not what they would consider target applications..
There are most definately issues with how parts of NVidia choose to promote 'GPGPU' (a term certainly not originated from NVidia, and one I dislike), and I dont believe that many in 'the industry' really believe that GPUs are even close to being GP.. They are a very good tool for a somewhat restricted subset of applications, where runtimes and datasets are suitable large, codes map well to the GPU model, and development times can be suitable 'extended'..
IMHO, the GP in GPGPU is heavily mis-understood - from my position it if refering to the fact that, unlike not that long ago, these days you can use what appear to be 'normal' languages, and write code that makes use of compute-type interfaces and functionality, whereas in the past using a GPU was a matter of making all problems look like image rendering - a much more difficult and limiting task.
I well remember when readback from a GPU was (artificially) limited to PCI speed on AGP busses, because the vendors did not care about readback..
To me, 'General Purpose' means the GPU is no longer just about transformation, lighting, and scanconversion..
I feel that our viewpoints are very aligned, and I would most strongly agree with your heterogeneous systems view. I do find it very tiresome that there are 2 other camps that seem to feel GPU are a threat or target for some reason.
1 - the 'parallel is too hard, and doesnt work anyway!' crowd, how want everything to be considered as scalar code, and only want faster CPUs - I also love faster scalar CPUs, with faster GPUs alongside!
2 - the 'GPUs are toys' brigade, who like to point out the weaknesses of GPUs (of which there are many!) and ignore their strengths, they often like to point to systems 20% faster with 20 times the budget..
I myself believe that the GPGPU approach has opened a whole new area of price/performance for a range of important codes, however that range is somewhat limited, which is probably for the best - to make a truely general-purpose GPU would probably reduce its performance to that of a general purpose CPU (surprise!), and Intel probably do that better anyway!
Rich Vuduc (richie@cc.gatech.edu) on 7/28/10 wrote:
---------------------------
>You raise two fair questions, which I'd like to address.
>
>[Question 1] Regarding how well tuned the GPU codes are, I'd be hypocritically
>violating Bailey's Rule #6 (use a bad baseline) if we didn't put in some effort.
>Whether the effort is *fair* is up for grabs, but I'll say the following:
>
>(a) For sparse matrix-vector multiply, the "best" GPU codes shown are the best
>of NVIDIA's implementations by Bell & Garland, as well as those of my student, Jee
>Choi. These are all tuned on the GPU fairly well.
>
>(b) For the sparse direct solver, we are using CUBLAS. It's debatable whether these
>implementations are the best out there, but one would trust they are reasonably
>well-tuned, though things could be better.
>
>(c) For the fast multipole method (FMM)---whose results are shown in single-precision,
>by the way---we did not write a Mickey Mouse code. The FMM uses a "direct 'n^2'
>n-body" computation as a subroutine. The updated version of our HotPar'10 talk slides,
>which appeared in a Dept. of Energy-hosted meeting called "SciDAC'10", show that
>when 'n' is sufficiently large this subroutine gets 640 Gflop/s (65% of peak) on
>a Fermi card. So, even if it's not the best code out there, I think we can claim
>it's not unreasonable. The only problem is that for the FMM, this subroutine has
>to run fast when 'n' is relatively much smaller, which is where the GPU advantage
>decreases. A fair question, then, is whether we can make this subroutine fast for
>small 'n'. We are actively doing this on the GPU this summer, because we see the
>GPU as an integral assistant in a full FMM code on a likely future system with both
>multicore CPU and GPU components. (Shameless plug: We are part of a team whose upcoming
>paper at Supercomputing'10 uses both CPU and GPU for the FMM.) Now, we're not done
>yet but hope to make significant in-roads for this case.
>
>[Question 2] You say that NVIDIA does not consider the computations we considered
>to be the prime targets of their systems. I suppose this is possible. However, they
>clearly have a high-performance computing strategy, and in HPC, we care about things
>like sparse iterative and direct solvers, as well as scalable n-body problems. I
>think physically-realistic games and graphics care about these things, too, though
>I'll admit right away that I'm not an expert on those kinds of apps. But just to
>throw it out there, a friend of mine at Lucas Arts, who led the physics engine development
>on The Force Unleased game, uses a finite-element solver to simulate how objects
>deform when you use the force on them. So if the computations we care about are
>not within the scope of what a GPU should be good at, it begs the question in my
>mind of how "general-purpose" a GPGPU is.
>
>I'd like to conclude by saying that I'm a big believer in heterogeneous systems
>with GPU components! We do a lot of GPU work at Georgia Tech and are heavily investing
>our research efforts on how to use these systems. The only real point of the talk
>was to say that, for the benefit of the applications development community that
>has to spend time writing all this code, we should forget the marketing hype, set
>realistic expectations, and do the hard work of figuring out how best to use these
>computational resources and building better tools.
>
>-- Rich V. @ Georgia Tech
>
>Anon (no@email.com) on 7/27/10 wrote:
>---------------------------
>>David Kanter (dkanter@realworldtech.com) on 7/27/10 wrote:
>>---------------------------
>>>Today is shaping up to be an excellent day for a number of reasons:
>>>
>>>We have an excellent new contributor, Tarek Chammah. Tarek is a graduate student
>>>at the University of Waterloo who specializes in software approaches to parallelization,
>>>including run-times, languages, APIs, etc.
>>>
>>>I recently had the opportunity to go to lunch with Tarek and we had an excellent
>>>time and I learned quite a lot about the trends on the software side of the equation.
>>>One of the points that Tarek emphatically made is that with the emergence of parallel
>>>processing, the software is becoming equally important as the hardware. Times are
>>>a changing, and it's not just about your good old compiler getting code into shape
>>>for the hardware; software is truly an essential part of the glue that binds together
>>>the system, and I hope to be able to discuss software more in the future at RWT.
>>>
>>>Second, Tarek has provided us with an excellent article covering some of the highlights
>>>of the HotPar 2010 workshop. Hot Par was held in Berkeley this year, and included
>>>a fair number of papers - but almost all of them were software focused. This is
>>>a nice change of pace from our usual coverage of ISSCC, Hot Chips or IEDM:
>>>
>>>http://www.realworldtech.com/page.cfm?ArticleID=RWT072610001641
>>>
>>>Please join me in thanking Tarek for his contribution, and I look forward to some lively discussions.
>>>
>>>
>>>David
>>
>>I would most certainly agree that more input on this subject is most welcome.
>>
>>I have two discussion questions for the section "The Limits of GPUs"
>>
>>Firstly, a lot of this content seem to run along the lines of 'untuned Intel xxx
>>code was a lot slower than the GPU, but then we spent a lot of time rewriting/tuning
>>the CPu code, and it got faster!' however no mention seems to be made of similar tuning efforts in GPU code.
>>I think most of the people involved in GPU Cuda programming will agree that it
>>is significantly HARDER to extract full potential from GPU code, although for 'suitable'
>>codes the gains are even larger - this looks/feels like one of these cases.... highly
>>tuned CPU code versus basic GPU code.
>>
>>It is also interesting (and the full information is not presented) that in the
>>second group of cases, we seems to be comparing DP codes on Tesla C1060, rather than the most certanly current 2070.
>>Now, a C1060 has around an 8:1 SP:DP ratio. The 2060 closer to 2:1, and nearly
>>7 TIMES the peak DP with a single GPU than the 1060... I do not doubt that the Nehalem
>>system is not the fastest current either, however I doubt a system 7 times faster could be found.
>>Secondly in this case, the codes being looked at are, as NVidia appears to have
>>pointed out, not really prime targets of their systems anyway (any yet their OLD systems do pretty well).
>>
>>Now, this could be seen as valid 'limits' of GPUs.
>>1 - Older implementations (and some current) are not great at DP.
>>2 - GPUs are very optimisation sensitive (tools are quite new, and they are not that flexible compute devices)
>>3 - GPUs performance varies strongly, not all target applications are suitable.
>>
>