By: David Kanter (dkanter.delete@this.realworldtech.com), July 25, 2012 11:28 am
Room: Moderated Discussions
someone (someone.delete@this.somewhere.com) on July 25, 2012 9:58 am wrote:
> David Kanter (dkanter.delete@this.realworldtech.com) on July 25, 2012 1:37 am
> wrote:
> > New computational efficiency data shows GPUs with a clear edge over
> CPUs, but
> > the gap is narrowing as CPUs adopt wide vectors (e.g. AVX).
> Surprisingly, a
> > throughput CPU is the most energy efficient processor,
> offering hope for future
> > architectures. Our data also shows some
> advantages of AMD's Bulldozer, and the
> > overhead associated with highly
> scalable server CPUs.
> >
> > Comments and feedback
> > welcome!
> >
>
> > David
>
> Calling FLOPS/W and FLOPs/mm2 efficiency is highly misleading
> because it has no
> concept of effective FLOPs while doing something useful. The
> FP functional units
> of a general purpose MPU is a tiny fraction of device area
> and power budget. Why?
That's right, no cache, no branch prediction, no bypassing, no store forwarding, etc.
There's a reason why I focus on compute efficiency, as opposed to performance efficiency. Compute != performance.
> Everything else there SUPPORTS feeding those units
> over a huge spectrum of usage
> in terms of data access and control complexity
> without demanding quite unreasonable
> effort and methods for programming. GPUs
> have less silicon overhead per FP unit
> because they are very less generally
> useful. For HPC algorithms with complex data
> access and control paths GPUs are
> hugely inefficient and can only approach a tiny
> fraction of their theoretical
> peak FLOPs. Why hasn't anyone run published a SPECfp
> score running entirely on
> a GPU yet? :-D
I totally agree. The easiest way to see that is to compare the cache for say, IVB (2MB/core) vs. Fermi (guessing ~64KB/core).
> In a modern process I could tile a 200 mm2 die with nothing
> but FMACs and clock
> and power distribution and blow away everything on your
> graph but it would not be
> capable of anything useful. But hey, what
> "efficiency" woot!
I agree (see Tilera!). However, if you want to talk about realizable FP performance, now you need to pick a workload.
What workload should we use?
What meaningful workload has been run (and reported) on all of those systems?
The closest is Linpack, but that hasn't been run on the T4 for obvious reasons.
No, what this chart measures is the *BEST CASE* for a GPU (i.e. something akin to Linpack). Any real workload will change the positions substantially and more complex ones will show that GPUs are less efficient.
David
> David Kanter (dkanter.delete@this.realworldtech.com) on July 25, 2012 1:37 am
> wrote:
> > New computational efficiency data shows GPUs with a clear edge over
> CPUs, but
> > the gap is narrowing as CPUs adopt wide vectors (e.g. AVX).
> Surprisingly, a
> > throughput CPU is the most energy efficient processor,
> offering hope for future
> > architectures. Our data also shows some
> advantages of AMD's Bulldozer, and the
> > overhead associated with highly
> scalable server CPUs.
> >
> > Comments and feedback
> > welcome!
> >
>
> > David
>
> Calling FLOPS/W and FLOPs/mm2 efficiency is highly misleading
> because it has no
> concept of effective FLOPs while doing something useful. The
> FP functional units
> of a general purpose MPU is a tiny fraction of device area
> and power budget. Why?
That's right, no cache, no branch prediction, no bypassing, no store forwarding, etc.
There's a reason why I focus on compute efficiency, as opposed to performance efficiency. Compute != performance.
> Everything else there SUPPORTS feeding those units
> over a huge spectrum of usage
> in terms of data access and control complexity
> without demanding quite unreasonable
> effort and methods for programming. GPUs
> have less silicon overhead per FP unit
> because they are very less generally
> useful. For HPC algorithms with complex data
> access and control paths GPUs are
> hugely inefficient and can only approach a tiny
> fraction of their theoretical
> peak FLOPs. Why hasn't anyone run published a SPECfp
> score running entirely on
> a GPU yet? :-D
I totally agree. The easiest way to see that is to compare the cache for say, IVB (2MB/core) vs. Fermi (guessing ~64KB/core).
> In a modern process I could tile a 200 mm2 die with nothing
> but FMACs and clock
> and power distribution and blow away everything on your
> graph but it would not be
> capable of anything useful. But hey, what
> "efficiency" woot!
I agree (see Tilera!). However, if you want to talk about realizable FP performance, now you need to pick a workload.
What workload should we use?
What meaningful workload has been run (and reported) on all of those systems?
The closest is Linpack, but that hasn't been run on the T4 for obvious reasons.
No, what this chart measures is the *BEST CASE* for a GPU (i.e. something akin to Linpack). Any real workload will change the positions substantially and more complex ones will show that GPUs are less efficient.
David



