By: jp (jipe4153.delete@this.gmail.com), July 27, 2012 9:18 am
Room: Moderated Discussions
Eric (eric.kjellen.delete@this.gmail.com) on July 27, 2012 7:57 am wrote:
> jp (jipe4153.delete@this.gmail.com) on July 27, 2012 7:08 am wrote:
> > The
> idea that it's easier to
> > squeeze more theoretical performance with
> multi-threading and SSE instructions
> > on a CPU is unfortunately not true.
>
> >
>
> I agree, but the question is if the superior programmability and
> better performance in some branching and data irregular workloads (and maybe
> most importantly, more consistent performance across many workloads), together
> with Intel's growing process technology advantage and maybe also advanced
> packaging (such as TSV stacking of DRAM to enable very high memory bandwidth for
> SIMD applications) will turn out to be the killer advantages of the CPU.
>
The point is that not all but most workloads do have enough fine grained parallelism to exploit SIMD capabilities.
About bandwidth, the GPUs already have the fastest RAM out there ( over 250 GB/s ) and they have no reason not to continue this lead (read mentioned FLOPS/bandwidth ratio "issue")
> My
> answer is yes, not least because of the many historical precedents where an
> opportunity to eliminate a co-processor at the expense of raw performance and
> complexity, and to the benefit of consistency and programmability/flexibility,
> has been pursued with a great deal of enthusiasm by the industry. Let's not
> forget that the origins of the GPUs lay in their ability to accelerate 3D
> graphics when the CPU was no longer enough. They are not a natural, inescapable
> feature of general-purpose computing.
>
You just agreed that it was not easier to develop a high performance solution for a CPU. Dont you think GPU vendors will continue to churn out more features and high level abstrations to simplify development? The answer is yes, example Nvidias OpenACC and fortrans pgi compiler for CUDA.
> And for large-scale HPC deployments, we
> have already seen that throughput-optimized CPUs are (at least for the moment)
> preferred by the most demanding customers and that they can deliver
> significantly higher efficiency than CPU + GPGPU systems even with regard to
> peak theoretical performance, not to speak of the performance that will
> realistically actually be achieved.
Looking at the articles at hpcwire about new clusters over the last 1.5 years its obvious that almost everyone is buying Nvidia Tesla cards. In fact ORNL:s newest cluster (fastest in the US) will be based on the new Kepler cards (K20).
> jp (jipe4153.delete@this.gmail.com) on July 27, 2012 7:08 am wrote:
> > The
> idea that it's easier to
> > squeeze more theoretical performance with
> multi-threading and SSE instructions
> > on a CPU is unfortunately not true.
>
> >
>
> I agree, but the question is if the superior programmability and
> better performance in some branching and data irregular workloads (and maybe
> most importantly, more consistent performance across many workloads), together
> with Intel's growing process technology advantage and maybe also advanced
> packaging (such as TSV stacking of DRAM to enable very high memory bandwidth for
> SIMD applications) will turn out to be the killer advantages of the CPU.
>
The point is that not all but most workloads do have enough fine grained parallelism to exploit SIMD capabilities.
About bandwidth, the GPUs already have the fastest RAM out there ( over 250 GB/s ) and they have no reason not to continue this lead (read mentioned FLOPS/bandwidth ratio "issue")
> My
> answer is yes, not least because of the many historical precedents where an
> opportunity to eliminate a co-processor at the expense of raw performance and
> complexity, and to the benefit of consistency and programmability/flexibility,
> has been pursued with a great deal of enthusiasm by the industry. Let's not
> forget that the origins of the GPUs lay in their ability to accelerate 3D
> graphics when the CPU was no longer enough. They are not a natural, inescapable
> feature of general-purpose computing.
>
You just agreed that it was not easier to develop a high performance solution for a CPU. Dont you think GPU vendors will continue to churn out more features and high level abstrations to simplify development? The answer is yes, example Nvidias OpenACC and fortrans pgi compiler for CUDA.
> And for large-scale HPC deployments, we
> have already seen that throughput-optimized CPUs are (at least for the moment)
> preferred by the most demanding customers and that they can deliver
> significantly higher efficiency than CPU + GPGPU systems even with regard to
> peak theoretical performance, not to speak of the performance that will
> realistically actually be achieved.
Looking at the articles at hpcwire about new clusters over the last 1.5 years its obvious that almost everyone is buying Nvidia Tesla cards. In fact ORNL:s newest cluster (fastest in the US) will be based on the new Kepler cards (K20).



