By: anon (anon.delete@this.anon.com), July 2, 2013 9:03 pm
Room: Moderated Discussions
Patrick Chase (patrickjchase.delete@this.gmail.com) on July 2, 2013 7:36 pm wrote:
> anon (anon.delete@this.anon.com) on July 2, 2013 5:47 pm wrote:
> > Patrick Chase (patrickjchase.delete@this.gmail.com) on July 2, 2013 4:43 pm wrote:
> > > anon (anon.delete@this.anon.com) on July 2, 2013 4:12 pm wrote:
> > > > Patrick Chase (patrickjchase.delete@this.gmail.com) on July 2, 2013 10:03 am wrote:
> > > > > You and Etienne are both off base, but Etienne is at least on the right path. If you're
> > > > > actually interested in learning then take a look through this presentation:
> > > > >
> > > > > http://s08.idav.ucdavis.edu/fatahalian-gpu-architecture.pdf
> > > > >
> > > > > It's fairly dated (i.e. the number are hilariously outdated
> > > > > some cases) but the concepts are presented correctly.
> > > >
> > > > This says nothing about whether GPU design will be more efficient than CPU design.
> > >
> > > It actually says quite a lot about that, for anybody with a basic understanding of microarchitecture.
> > >
> > > What is says is that GPUs are optimized for and extremely efficient at tasks with a very large number of
> > > independent, mostly-similar (low code divergence) work items.
> >
> > Everyone knows that. What it does not say is *why* these structures and approaches
> > are more efficient. Latency tolerance allowing lower clocks obviously, which is what
> > I already mentioned. I'm sure there are many others from low level circuit design to
> > microachitecture and uncore, which I don't know about.
>
> The Fatahalian SIGGRAPH09 slides are actually fairly clear on this point, but maybe only in a way
> that would only be obvious to someobody who already understands architecture. The following microarchitectural
> features serve to either minimize or hide latency, and are not used in GPUs:
>
> - Out-of-order execution
>
> - Large caches
>
> - Forwarding networks
>
> - HW prefetching
See, this is a pretty handwavy list. You could list every difference between a CPU and a GPU and say "that's why it's more efficient".
But if you look at BG/Q/PPC-A2 for example, it has a number of these things. It has large caches, HW prefetching, and it does pretty significant speculation (although not OOOE) of branches and transctions. Not sure if it has forwarding networks. Probably it does within the A2 integer core, but not within vector units. It has significant capabilities for data sharing, and its vectors are not so large and latency so high that it needs vast numbers of threads and registers to fill the units.
Small OOOE ARMs like A9r4, if coupled with a large vector unit like Xeon Phi, would be able to achieve the above with OOOE as well, at reasonable flops/W.
Caches don't use a lot of power, for example. They can even save more than they use, even on FP codes.
I think clock speed is definitely a common denominator. At 1/2-1/4 the clock of high performance CPUs, it seems logic is more efficient. But this goes for low clocked general purpose CPUs too, not just GPUs.
Result forwarding is probably a factor too, but I simply don't know enough to quantify. Do you? I was after something quantifiable rather than just dot points of things that may or may not improve flops/w much.
> anon (anon.delete@this.anon.com) on July 2, 2013 5:47 pm wrote:
> > Patrick Chase (patrickjchase.delete@this.gmail.com) on July 2, 2013 4:43 pm wrote:
> > > anon (anon.delete@this.anon.com) on July 2, 2013 4:12 pm wrote:
> > > > Patrick Chase (patrickjchase.delete@this.gmail.com) on July 2, 2013 10:03 am wrote:
> > > > > You and Etienne are both off base, but Etienne is at least on the right path. If you're
> > > > > actually interested in learning then take a look through this presentation:
> > > > >
> > > > > http://s08.idav.ucdavis.edu/fatahalian-gpu-architecture.pdf
> > > > >
> > > > > It's fairly dated (i.e. the number are hilariously outdated
> > > > > some cases) but the concepts are presented correctly.
> > > >
> > > > This says nothing about whether GPU design will be more efficient than CPU design.
> > >
> > > It actually says quite a lot about that, for anybody with a basic understanding of microarchitecture.
> > >
> > > What is says is that GPUs are optimized for and extremely efficient at tasks with a very large number of
> > > independent, mostly-similar (low code divergence) work items.
> >
> > Everyone knows that. What it does not say is *why* these structures and approaches
> > are more efficient. Latency tolerance allowing lower clocks obviously, which is what
> > I already mentioned. I'm sure there are many others from low level circuit design to
> > microachitecture and uncore, which I don't know about.
>
> The Fatahalian SIGGRAPH09 slides are actually fairly clear on this point, but maybe only in a way
> that would only be obvious to someobody who already understands architecture. The following microarchitectural
> features serve to either minimize or hide latency, and are not used in GPUs:
>
> - Out-of-order execution
>
> - Large caches
>
> - Forwarding networks
>
> - HW prefetching
See, this is a pretty handwavy list. You could list every difference between a CPU and a GPU and say "that's why it's more efficient".
But if you look at BG/Q/PPC-A2 for example, it has a number of these things. It has large caches, HW prefetching, and it does pretty significant speculation (although not OOOE) of branches and transctions. Not sure if it has forwarding networks. Probably it does within the A2 integer core, but not within vector units. It has significant capabilities for data sharing, and its vectors are not so large and latency so high that it needs vast numbers of threads and registers to fill the units.
Small OOOE ARMs like A9r4, if coupled with a large vector unit like Xeon Phi, would be able to achieve the above with OOOE as well, at reasonable flops/W.
Caches don't use a lot of power, for example. They can even save more than they use, even on FP codes.
I think clock speed is definitely a common denominator. At 1/2-1/4 the clock of high performance CPUs, it seems logic is more efficient. But this goes for low clocked general purpose CPUs too, not just GPUs.
Result forwarding is probably a factor too, but I simply don't know enough to quantify. Do you? I was after something quantifiable rather than just dot points of things that may or may not improve flops/w much.