By: Eric (eric.kjellen.delete@this.gmail.com), July 27, 2012 5:51 am
Room: Moderated Discussions
Chuck (dont.delete@this.thanks.com) on July 25, 2012 8:13 pm wrote:
> David Kanter (dkanter.delete@this.realworldtech.com) on July 25, 2012 11:28 am
> wrote:
> > someone (someone.delete@this.somewhere.com) on July 25, 2012 9:58
> am
> > wrote:
> > > David Kanter (dkanter.delete@this.realworldtech.com)
> on July 25,
> > 2012 1:37 am
> > > wrote:
> > > > New
> computational efficiency data shows
> > GPUs with a clear edge over
> >
> > CPUs, but
> > > > the gap is narrowing as
> > CPUs adopt wide
> vectors (e.g. AVX).
> > > Surprisingly, a
> > > >
> >
> throughput CPU is the most energy efficient processor,
> > > offering hope
> for
> > future
> > > > architectures. Our data also shows some
>
> > > advantages of
> > AMD's Bulldozer, and the
> > > >
> overhead associated with highly
> > >
> > scalable server CPUs.
> >
> > >
> > > > Comments and feedback
> > > >
> >
> welcome!
> > > >
> > >
> > > > David
> > >
> >
> > Calling FLOPS/W and
> > FLOPs/mm2 efficiency is highly misleading
> >
> > because it has no
> > > concept
> > of effective FLOPs while doing
> something useful. The
> > > FP functional
> > units
> > > of a
> general purpose MPU is a tiny fraction of device area
> > >
> > and
> power budget. Why?
> >
> > That's right, no cache, no branch prediction, no
>
> > bypassing, no store forwarding, etc.
> >
> > There's a reason why I
> focus on compute
> > efficiency, as opposed to performance efficiency.
> Compute != performance.
> >
> >
> > > Everything else there SUPPORTS
> feeding those units
> > > over a huge
> > spectrum of usage
> > >
> in terms of data access and control complexity
> > >
> > without
> demanding quite unreasonable
> > > effort and methods for programming.
>
> > GPUs
> > > have less silicon overhead per FP unit
> > >
> because they are very
> > less generally
> > > useful. For HPC
> algorithms with complex data
> > > access
> > and control paths GPUs are
>
> > > hugely inefficient and can only approach a
> > tiny
> > >
> fraction of their theoretical
> > > peak FLOPs. Why hasn't anyone
> >
> run published a SPECfp
> > > score running entirely on
> > > a GPU
> yet?
> > :-D
> >
> > I totally agree. The easiest way to see that is to
> compare the cache for
> > say, IVB (2MB/core) vs. Fermi (guessing
> ~64KB/core).
> >
> > > In a modern
> > process I could tile a 200
> mm2 die with nothing
> > > but FMACs and clock
> > >
> > and power
> distribution and blow away everything on your
> > > graph but it
> >
> would not be
> > > capable of anything useful. But hey, what
> > >
> "efficiency"
> > woot!
> >
> > I agree (see Tilera!). However, if you
> want to talk about realizable
> > FP performance, now you need to pick a
> workload.
> >
> > What workload should we
> > use?
> >
> > What
> meaningful workload has been run (and reported) on all of those
> >
> systems?
> >
> > The closest is Linpack, but that hasn't been run on the T4
> for
> > obvious reasons.
> >
> > No, what this chart measures is the
> *BEST CASE* for a GPU
> > (i.e. something akin to Linpack). Any real
> workload will change the positions
> > substantially and more complex ones
> will show that GPUs are less
> > efficient.
> >
> > David
>
>
> I think
> what Paul is getting at is that you're presenting a limited and artificial
> characterization of efficiency of modern chips intended for high FP
> loads.
>
> For example you repeatedly castigate Wilco for bringing up Dhrystone
> and CoreMark in defense of ARM processors, as being unrepresentative. Yet here
> you're essentially presenting efficiency data for computational kernels, not
> real world parallel FP applications.
>
> An uncharitable person would call you a
> hypocrite. You are a hypocrite.
>
> If you don't like SPECfp_rate then at least
> man up, get some Nvidia, Intel, and AMD/ATI chips and benchmark them with either
> SPLASH-2 (old), SPEChpc2002, or (preferably) PARSEC.
>
> Be a man.
Well, either that or you take these numbers for what they're worth (just like you should take Dhrystone numbers for what they're worth, no more and no less) and then try to look at what other factors will do to affect real world performance further down the road. Benchmarking complete HPC systems would be very difficult and introduce numerous more or less relevant (for a discussion of CPU vs. GPU architectural approaches to HPC) factors that can't be isolated. If David Kanter has implied that you can draw conclusions about system-level performance in real world applications based on these numbers alone, then I think he's wrong, but I'm not sure he has. I still think that peak DP performance (per unit of power or area) is an interesting data point and by extension that a comparison of peak performance numbers has something to add to the discussion.
> David Kanter (dkanter.delete@this.realworldtech.com) on July 25, 2012 11:28 am
> wrote:
> > someone (someone.delete@this.somewhere.com) on July 25, 2012 9:58
> am
> > wrote:
> > > David Kanter (dkanter.delete@this.realworldtech.com)
> on July 25,
> > 2012 1:37 am
> > > wrote:
> > > > New
> computational efficiency data shows
> > GPUs with a clear edge over
> >
> > CPUs, but
> > > > the gap is narrowing as
> > CPUs adopt wide
> vectors (e.g. AVX).
> > > Surprisingly, a
> > > >
> >
> throughput CPU is the most energy efficient processor,
> > > offering hope
> for
> > future
> > > > architectures. Our data also shows some
>
> > > advantages of
> > AMD's Bulldozer, and the
> > > >
> overhead associated with highly
> > >
> > scalable server CPUs.
> >
> > >
> > > > Comments and feedback
> > > >
> >
> welcome!
> > > >
> > >
> > > > David
> > >
> >
> > Calling FLOPS/W and
> > FLOPs/mm2 efficiency is highly misleading
> >
> > because it has no
> > > concept
> > of effective FLOPs while doing
> something useful. The
> > > FP functional
> > units
> > > of a
> general purpose MPU is a tiny fraction of device area
> > >
> > and
> power budget. Why?
> >
> > That's right, no cache, no branch prediction, no
>
> > bypassing, no store forwarding, etc.
> >
> > There's a reason why I
> focus on compute
> > efficiency, as opposed to performance efficiency.
> Compute != performance.
> >
> >
> > > Everything else there SUPPORTS
> feeding those units
> > > over a huge
> > spectrum of usage
> > >
> in terms of data access and control complexity
> > >
> > without
> demanding quite unreasonable
> > > effort and methods for programming.
>
> > GPUs
> > > have less silicon overhead per FP unit
> > >
> because they are very
> > less generally
> > > useful. For HPC
> algorithms with complex data
> > > access
> > and control paths GPUs are
>
> > > hugely inefficient and can only approach a
> > tiny
> > >
> fraction of their theoretical
> > > peak FLOPs. Why hasn't anyone
> >
> run published a SPECfp
> > > score running entirely on
> > > a GPU
> yet?
> > :-D
> >
> > I totally agree. The easiest way to see that is to
> compare the cache for
> > say, IVB (2MB/core) vs. Fermi (guessing
> ~64KB/core).
> >
> > > In a modern
> > process I could tile a 200
> mm2 die with nothing
> > > but FMACs and clock
> > >
> > and power
> distribution and blow away everything on your
> > > graph but it
> >
> would not be
> > > capable of anything useful. But hey, what
> > >
> "efficiency"
> > woot!
> >
> > I agree (see Tilera!). However, if you
> want to talk about realizable
> > FP performance, now you need to pick a
> workload.
> >
> > What workload should we
> > use?
> >
> > What
> meaningful workload has been run (and reported) on all of those
> >
> systems?
> >
> > The closest is Linpack, but that hasn't been run on the T4
> for
> > obvious reasons.
> >
> > No, what this chart measures is the
> *BEST CASE* for a GPU
> > (i.e. something akin to Linpack). Any real
> workload will change the positions
> > substantially and more complex ones
> will show that GPUs are less
> > efficient.
> >
> > David
>
>
> I think
> what Paul is getting at is that you're presenting a limited and artificial
> characterization of efficiency of modern chips intended for high FP
> loads.
>
> For example you repeatedly castigate Wilco for bringing up Dhrystone
> and CoreMark in defense of ARM processors, as being unrepresentative. Yet here
> you're essentially presenting efficiency data for computational kernels, not
> real world parallel FP applications.
>
> An uncharitable person would call you a
> hypocrite. You are a hypocrite.
>
> If you don't like SPECfp_rate then at least
> man up, get some Nvidia, Intel, and AMD/ATI chips and benchmark them with either
> SPLASH-2 (old), SPEChpc2002, or (preferably) PARSEC.
>
> Be a man.
Well, either that or you take these numbers for what they're worth (just like you should take Dhrystone numbers for what they're worth, no more and no less) and then try to look at what other factors will do to affect real world performance further down the road. Benchmarking complete HPC systems would be very difficult and introduce numerous more or less relevant (for a discussion of CPU vs. GPU architectural approaches to HPC) factors that can't be isolated. If David Kanter has implied that you can draw conclusions about system-level performance in real world applications based on these numbers alone, then I think he's wrong, but I'm not sure he has. I still think that peak DP performance (per unit of power or area) is an interesting data point and by extension that a comparison of peak performance numbers has something to add to the discussion.
Topic | Posted By | Date |
---|---|---|
New Article: Compute Efficiency 2012 | David Kanter | 2012/07/25 01:37 AM |
New Article: Compute Efficiency 2012 | SHK | 2012/07/25 02:31 AM |
New Article: Compute Efficiency 2012 | David Kanter | 2012/07/25 02:42 AM |
New Article: Compute Efficiency 2012 | none | 2012/07/25 03:18 AM |
New Article: Compute Efficiency 2012 | David Kanter | 2012/07/25 11:25 AM |
GCN (NT) | EBFE | 2012/07/25 03:25 AM |
GCN - TFLOP DP | jp | 2012/08/09 01:58 PM |
GCN - TFLOP DP | David Kanter | 2012/08/09 03:32 PM |
GCN - TFLOP DP | Kevin G | 2012/08/11 05:22 PM |
GCN - TFLOP DP | Eric | 2012/08/09 05:12 PM |
GCN - TFLOP DP | jp | 2012/08/10 01:23 AM |
GCN - TFLOP DP | EBFE | 2012/08/12 08:27 PM |
GCN - TFLOP DP | jp | 2012/08/13 02:02 AM |
GCN - TFLOP DP | EBFE | 2012/08/13 07:45 PM |
GCN - TFLOP DP | jp | 2012/08/14 01:21 AM |
New Article: Compute Efficiency 2012 | Adrian | 2012/07/25 04:39 AM |
New Article: Compute Efficiency 2012 | EBFE | 2012/07/25 09:33 AM |
New Article: Compute Efficiency 2012 | David Kanter | 2012/07/25 11:11 AM |
New Article: Compute Efficiency 2012 | sf | 2012/07/25 06:46 AM |
New Article: Compute Efficiency 2012 | aaron spink | 2012/07/25 09:08 AM |
New Article: Compute Efficiency 2012 | someone | 2012/07/25 10:06 AM |
New Article: Compute Efficiency 2012 | David Kanter | 2012/07/25 11:14 AM |
New Article: Compute Efficiency 2012 | EBFE | 2012/07/26 02:27 AM |
BG/Q | David Kanter | 2012/07/26 09:31 AM |
VR-ZONE KNC B0 leak, poor number? | EBFE | 2012/08/03 01:57 AM |
VR-ZONE KNC B0 leak, poor number? | Eric | 2012/08/03 07:59 AM |
VR-ZONE KNC B0 leak, poor number? | EBFE | 2012/08/04 06:37 AM |
VR-ZONE KNC B0 leak, poor number? | aaron spink | 2012/08/04 06:51 PM |
Leaks != products | David Kanter | 2012/08/05 03:19 AM |
Leaks != products | EBFE | 2012/08/06 02:49 AM |
VR-ZONE KNC B0 leak, poor number? | Eric | 2012/08/05 10:37 AM |
VR-ZONE KNC B0 leak, poor number? | EBFE | 2012/08/06 03:09 AM |
VR-ZONE KNC B0 leak, poor number? | aaron spink | 2012/08/06 04:33 AM |
VR-ZONE KNC B0 leak, poor number? | jp | 2012/08/07 03:08 AM |
VR-ZONE KNC B0 leak, poor number? | Eric | 2012/08/07 04:58 AM |
VR-ZONE KNC B0 leak, poor number? | jp | 2012/08/07 05:17 AM |
VR-ZONE KNC B0 leak, poor number? | Eric | 2012/08/07 05:22 AM |
VR-ZONE KNC B0 leak, poor number? | anonymou5 | 2012/08/07 09:43 AM |
VR-ZONE KNC B0 leak, poor number? | jp | 2012/08/07 05:23 AM |
VR-ZONE KNC B0 leak, poor number? | aaron spink | 2012/08/07 07:24 AM |
VR-ZONE KNC B0 leak, poor number? | aaron spink | 2012/08/07 07:20 AM |
VR-ZONE KNC B0 leak, poor number? | jp | 2012/08/07 11:22 AM |
VR-ZONE KNC B0 leak, poor number? | EduardoS | 2012/08/07 03:15 PM |
KNC has FMA | David Kanter | 2012/08/07 09:17 AM |
New Article: Compute Efficiency 2012 | forestlaughing | 2012/07/25 08:51 AM |
New Article: Compute Efficiency 2012 | Eric | 2012/07/27 05:12 AM |
New Article: Compute Efficiency 2012 | hobold | 2012/07/27 11:53 AM |
New Article: Compute Efficiency 2012 | Eric | 2012/07/27 12:51 PM |
New Article: Compute Efficiency 2012 | hobold | 2012/07/27 02:48 PM |
New Article: Compute Efficiency 2012 | Eric | 2012/07/27 03:29 PM |
New Article: Compute Efficiency 2012 | anon | 2012/07/29 02:25 AM |
New Article: Compute Efficiency 2012 | hobold | 2012/07/29 11:53 AM |
Efficiency? No, lack of highly useful features | someone | 2012/07/25 09:58 AM |
Best case for GPUs | David Kanter | 2012/07/25 11:28 AM |
Best case for GPUs | franzliszt | 2012/07/25 01:39 PM |
Best case for GPUs | Chuck | 2012/07/25 08:13 PM |
Best case for GPUs | David Kanter | 2012/07/25 09:45 PM |
Best case for GPUs | Eric | 2012/07/27 05:51 AM |
Silverthorn data point | Michael S | 2012/07/25 02:45 PM |
Silverthorn data point | David Kanter | 2012/07/25 04:06 PM |
New Article: Compute Efficiency 2012 | Unununium | 2012/07/25 05:55 PM |
New Article: Compute Efficiency 2012 | EduardoS | 2012/07/25 08:12 PM |
Ops... I'm wrong... | EduardoS | 2012/07/25 08:14 PM |
New Article: Compute Efficiency 2012 | TacoBell | 2012/07/25 08:36 PM |
New Article: Compute Efficiency 2012 | David Kanter | 2012/07/25 09:49 PM |
New Article: Compute Efficiency 2012 | Michael S | 2012/07/26 02:33 AM |
Line and factor | Moritz | 2012/07/26 01:34 AM |
Line and factor | Peter Boyle | 2012/07/27 07:57 AM |
not entirely | Moritz | 2012/07/27 12:22 PM |
Line and factor | EduardoS | 2012/07/27 05:24 PM |
Line and factor | Moritz | 2012/07/28 12:52 PM |
tables | Michael S | 2012/07/26 02:39 AM |
Interlagos L2+L3 | Rana | 2012/07/26 03:13 AM |
Interlagos L2+L3 | Rana | 2012/07/26 03:13 AM |
Interlagos L2+L3 | David Kanter | 2012/07/26 09:21 AM |
SP vs DP & performance metrics | jp | 2012/07/27 07:08 AM |
SP vs DP & performance metrics | Eric | 2012/07/27 07:57 AM |
SP vs DP & performance metrics | jp | 2012/07/27 09:18 AM |
SP vs DP & performance metrics | aaron spink | 2012/07/27 09:36 AM |
SP vs DP & performance metrics | jp | 2012/07/27 09:47 AM |
"Global" --> system | Paul A. Clayton | 2012/07/27 10:31 AM |
"Global" --> system | jp | 2012/07/27 03:55 PM |
"Global" --> system | aaron spink | 2012/07/27 07:33 PM |
"Global" --> system | jp | 2012/07/28 02:00 AM |
"Global" --> system | aaron spink | 2012/07/28 06:54 AM |
"Global" --> system | jp | 2012/07/29 02:12 AM |
"Global" --> system | aaron spink | 2012/07/29 05:03 AM |
"Global" --> system | none | 2012/07/29 09:05 AM |
"Global" --> system | EduardoS | 2012/07/29 10:26 AM |
"Global" --> system | jp | 2012/07/30 02:24 AM |
"Global" --> system | aaron spink | 2012/07/30 03:05 AM |
"Global" --> system | aaron spink | 2012/07/30 03:03 AM |
daxpy is STREAM TRIAD | Paul A. Clayton | 2012/07/30 06:10 AM |
SP vs DP & performance metrics | aaron spink | 2012/07/27 07:25 PM |
SP vs DP & performance metrics | Emil Briggs | 2012/07/28 06:40 AM |
SP vs DP & performance metrics | aaron spink | 2012/07/28 07:05 AM |
SP vs DP & performance metrics | jp | 2012/07/28 11:04 AM |
SP vs DP & performance metrics | Brett | 2012/07/28 03:32 PM |
SP vs DP & performance metrics | Emil Briggs | 2012/07/28 06:11 PM |
SP vs DP & performance metrics | anon | 2012/07/29 02:53 AM |
SP vs DP & performance metrics | aaron spink | 2012/07/29 05:39 AM |
Coherency for discretes | Rohit | 2012/07/29 09:24 AM |
SP vs DP & performance metrics | anon | 2012/07/29 11:09 AM |
SP vs DP & performance metrics | Eric | 2012/07/29 01:08 PM |
SP vs DP & performance metrics | aaron spink | 2012/07/27 09:25 AM |
Regular updates? | Joe | 2012/07/27 09:35 AM |
New Article: Compute Efficiency 2012 | 309 | 2012/07/27 10:34 PM |
New Article: Compute Efficiency 2012 | Ingeneer | 2012/07/30 09:01 AM |
New Article: Compute Efficiency 2012 | David Kanter | 2012/07/30 01:11 PM |
New Article: Compute Efficiency 2012 | Ingeneer | 2012/07/30 08:04 PM |
New Article: Compute Efficiency 2012 | David Kanter | 2012/07/30 09:32 PM |
Memory power and bandwidth? | Iain McClatchie | 2012/08/03 04:35 PM |
Memory power and bandwidth? | David Kanter | 2012/08/04 11:22 AM |
Memory power and bandwidth? | Michael S | 2012/08/04 02:36 PM |
Memory power and bandwidth? | Iain McClatchie | 2012/08/06 02:09 PM |
Memory power and bandwidth? | Eric | 2012/08/07 06:28 PM |
Workloads | David Kanter | 2012/08/08 10:49 AM |
Workloads | Eric | 2012/08/09 05:21 PM |
Latency and bandwidth bottlenecks | Paul A. Clayton | 2012/08/08 04:02 PM |
Latency and bandwidth bottlenecks | Eric | 2012/08/09 05:32 PM |
Latency and bandwidth bottlenecks | none | 2012/08/10 06:06 AM |
Latency and bandwidth bottlenecks -> BDP | ajensen | 2012/08/11 03:21 PM |
Memory power and bandwidth? | Ingeneer | 2012/08/06 11:26 AM |
NV aims for 1.8+ TFLOPS DP ? | jp | 2012/08/11 01:21 PM |
NV aims for 1.8+ TFLOPS DP ? | David Kanter | 2012/08/11 09:25 PM |
NV aims for 1.8+ TFLOPS DP ? | jp | 2012/08/12 02:45 AM |
NV aims for 1.8+ TFLOPS DP ? | EBFE | 2012/08/12 10:02 PM |
NV aims for 1.8+ TFLOPS DP ? | jp | 2012/08/13 01:54 AM |
NV aims for 1.8+ TFLOPS DP ? | Gabriele Svelto | 2012/08/13 09:16 AM |
NV aims for 1.8+ TFLOPS DP ? | Vincent Diepeveen | 2012/08/14 03:04 AM |
NV aims for 1.8+ TFLOPS DP ? | David Kanter | 2012/08/13 09:50 AM |
NV aims for 1.8+ TFLOPS DP ? | jp | 2012/08/13 11:17 AM |
NV aims for 1.8+ TFLOPS DP ? | EduardoS | 2012/08/13 06:45 AM |