By: Patrick Chase (patrickjchase.delete@this.gmail.com), July 2, 2013 7:36 pm
Room: Moderated Discussions
anon (anon.delete@this.anon.com) on July 2, 2013 5:47 pm wrote:
> Patrick Chase (patrickjchase.delete@this.gmail.com) on July 2, 2013 4:43 pm wrote:
> > anon (anon.delete@this.anon.com) on July 2, 2013 4:12 pm wrote:
> > > Patrick Chase (patrickjchase.delete@this.gmail.com) on July 2, 2013 10:03 am wrote:
> > > > You and Etienne are both off base, but Etienne is at least on the right path. If you're
> > > > actually interested in learning then take a look through this presentation:
> > > >
> > > > http://s08.idav.ucdavis.edu/fatahalian-gpu-architecture.pdf
> > > >
> > > > It's fairly dated (i.e. the number are hilariously outdated
> > > > some cases) but the concepts are presented correctly.
> > >
> > > This says nothing about whether GPU design will be more efficient than CPU design.
> >
> > It actually says quite a lot about that, for anybody with a basic understanding of microarchitecture.
> >
> > What is says is that GPUs are optimized for and extremely efficient at tasks with a very large number of
> > independent, mostly-similar (low code divergence) work items.
>
> Everyone knows that. What it does not say is *why* these structures and approaches
> are more efficient. Latency tolerance allowing lower clocks obviously, which is what
> I already mentioned. I'm sure there are many others from low level circuit design to
> microachitecture and uncore, which I don't know about.
The Fatahalian SIGGRAPH09 slides are actually fairly clear on this point, but maybe only in a way that would only be obvious to someobody who already understands architecture. The following microarchitectural features serve to either minimize or hide latency, and are not used in GPUs:
- Out-of-order execution
- Large caches
- Forwarding networks
- HW prefetching
If you look at a die shot of a modern core like Ivy Bridge or Haswell, the aforementioned features take up more area than do the actual functional units.
In addition GPUs are designed for workloads that have high code-path commonality, and so they share instruction decode/issue across multiple threads (in other words, they are a form of SIMD).
The net impact is that GPU's are "mostly functional units" whereas traditional CPUs are "mostly cache/control". Which is better depends on whether the workload intrinsically requires the control/cache capabilities of the CPU.
> Patrick Chase (patrickjchase.delete@this.gmail.com) on July 2, 2013 4:43 pm wrote:
> > anon (anon.delete@this.anon.com) on July 2, 2013 4:12 pm wrote:
> > > Patrick Chase (patrickjchase.delete@this.gmail.com) on July 2, 2013 10:03 am wrote:
> > > > You and Etienne are both off base, but Etienne is at least on the right path. If you're
> > > > actually interested in learning then take a look through this presentation:
> > > >
> > > > http://s08.idav.ucdavis.edu/fatahalian-gpu-architecture.pdf
> > > >
> > > > It's fairly dated (i.e. the number are hilariously outdated
> > > > some cases) but the concepts are presented correctly.
> > >
> > > This says nothing about whether GPU design will be more efficient than CPU design.
> >
> > It actually says quite a lot about that, for anybody with a basic understanding of microarchitecture.
> >
> > What is says is that GPUs are optimized for and extremely efficient at tasks with a very large number of
> > independent, mostly-similar (low code divergence) work items.
>
> Everyone knows that. What it does not say is *why* these structures and approaches
> are more efficient. Latency tolerance allowing lower clocks obviously, which is what
> I already mentioned. I'm sure there are many others from low level circuit design to
> microachitecture and uncore, which I don't know about.
The Fatahalian SIGGRAPH09 slides are actually fairly clear on this point, but maybe only in a way that would only be obvious to someobody who already understands architecture. The following microarchitectural features serve to either minimize or hide latency, and are not used in GPUs:
- Out-of-order execution
- Large caches
- Forwarding networks
- HW prefetching
If you look at a die shot of a modern core like Ivy Bridge or Haswell, the aforementioned features take up more area than do the actual functional units.
In addition GPUs are designed for workloads that have high code-path commonality, and so they share instruction decode/issue across multiple threads (in other words, they are a form of SIMD).
The net impact is that GPU's are "mostly functional units" whereas traditional CPUs are "mostly cache/control". Which is better depends on whether the workload intrinsically requires the control/cache capabilities of the CPU.