By: Symmetry (someone.delete@this.somewhere.com), July 3, 2013 6:14 am
Room: Moderated Discussions
anon (anon.delete@this.anon.com) on July 2, 2013 9:03 pm wrote:
> Patrick Chase (patrickjchase.delete@this.gmail.com) on July 2, 2013 7:36 pm wrote:
> > The Fatahalian SIGGRAPH09 slides are actually fairly clear on this point, but maybe only in a way
> > that would only be obvious to someobody who already understands
> > architecture. The following microarchitectural
> > features serve to either minimize or hide latency, and are not used in GPUs:
> >
> > - Out-of-order execution
> >
> > - Large caches
> >
> > - Forwarding networks
> >
> > - HW prefetching
>
> See, this is a pretty handwavy list. You could list every difference
> between a CPU and a GPU and say "that's why it's more efficient".
It's almost as if designers of GPUs and CPUs selected the features they incorporated based on the workload...
> But if you look at BG/Q/PPC-A2 for example, it has a number of these things. It has large caches, HW
> prefetching, and it does pretty significant speculation (although not OOOE) of branches and transctions.
> Not sure if it has forwarding networks. Probably it does within the A2 integer core, but not within
> vector units. It has significant capabilities for data sharing, and its vectors are not so large and
> latency so high that it needs vast numbers of threads and registers to fill the units.
Yes, and the PPC-A2 isn't a GPU, so that's to be expected.
> Small OOOE ARMs like A9r4, if coupled with a large vector unit like Xeon Phi,
> would be able to achieve the above with OOOE as well, at reasonable flops/W.
Would the vector unit have all the normal OOOE attachments in this example? I know NVidia GPUs have used scoreboards to select the next instruction from a pool of ready threads and selecting the next instruction out of order wouldn't be that much more complicated since you're amortizing cost of increasing the scheduler size over the width of the vector. But even then you have to worry about having a consistant state in the event of an interrupt and that means a storage requirement that really does grow with the width of the vector. And if you want to do the whole register-renaming rigmarole that your typical OOOE cpu does then you can totally forget about keeping flops/W parity with a dedicated GPU.
But to the extent that you're wasting area on the CPU core your chips are going to be less cost efficient compared to your competitor in the GPU space who just makes good GPUs. And to the extent that the memory subsystems are tuned to favor throughput over latency your A9 isn't going to perform well compared to one that's embedded in a more appropriate cache system, so much so that I'm not sure an A9 would actually give much benefit over an A7.
> Caches don't use a lot of power, for example. They can even save more than they use, even on FP codes.
No they don't, but they do use a lot of area. If you're looking to be cost effective and sell GPUs to consumers you wouldn't want to waste area on large caches. If you're just designing a chip to win a Green Supercomputer contest then knock yourself out.
> Result forwarding is probably a factor too, but I simply don't know enough to quantify. Do you? I was after
> something quantifiable rather than just dot points of things that may or may not improve flops/w much.
That's going to be very, very implementation dependant. The best way to get a sense of that is to read papers by academics who implemented cores while adding or removing various features, I think. Like seeing what OOOE does to power and speed here:
http://hps.ece.utexas.edu/people/khubaib/pub/morphcore_micro2012.pdf
> Patrick Chase (patrickjchase.delete@this.gmail.com) on July 2, 2013 7:36 pm wrote:
> > The Fatahalian SIGGRAPH09 slides are actually fairly clear on this point, but maybe only in a way
> > that would only be obvious to someobody who already understands
> > architecture. The following microarchitectural
> > features serve to either minimize or hide latency, and are not used in GPUs:
> >
> > - Out-of-order execution
> >
> > - Large caches
> >
> > - Forwarding networks
> >
> > - HW prefetching
>
> See, this is a pretty handwavy list. You could list every difference
> between a CPU and a GPU and say "that's why it's more efficient".
It's almost as if designers of GPUs and CPUs selected the features they incorporated based on the workload...
> But if you look at BG/Q/PPC-A2 for example, it has a number of these things. It has large caches, HW
> prefetching, and it does pretty significant speculation (although not OOOE) of branches and transctions.
> Not sure if it has forwarding networks. Probably it does within the A2 integer core, but not within
> vector units. It has significant capabilities for data sharing, and its vectors are not so large and
> latency so high that it needs vast numbers of threads and registers to fill the units.
Yes, and the PPC-A2 isn't a GPU, so that's to be expected.
> Small OOOE ARMs like A9r4, if coupled with a large vector unit like Xeon Phi,
> would be able to achieve the above with OOOE as well, at reasonable flops/W.
Would the vector unit have all the normal OOOE attachments in this example? I know NVidia GPUs have used scoreboards to select the next instruction from a pool of ready threads and selecting the next instruction out of order wouldn't be that much more complicated since you're amortizing cost of increasing the scheduler size over the width of the vector. But even then you have to worry about having a consistant state in the event of an interrupt and that means a storage requirement that really does grow with the width of the vector. And if you want to do the whole register-renaming rigmarole that your typical OOOE cpu does then you can totally forget about keeping flops/W parity with a dedicated GPU.
But to the extent that you're wasting area on the CPU core your chips are going to be less cost efficient compared to your competitor in the GPU space who just makes good GPUs. And to the extent that the memory subsystems are tuned to favor throughput over latency your A9 isn't going to perform well compared to one that's embedded in a more appropriate cache system, so much so that I'm not sure an A9 would actually give much benefit over an A7.
> Caches don't use a lot of power, for example. They can even save more than they use, even on FP codes.
No they don't, but they do use a lot of area. If you're looking to be cost effective and sell GPUs to consumers you wouldn't want to waste area on large caches. If you're just designing a chip to win a Green Supercomputer contest then knock yourself out.
> Result forwarding is probably a factor too, but I simply don't know enough to quantify. Do you? I was after
> something quantifiable rather than just dot points of things that may or may not improve flops/w much.
That's going to be very, very implementation dependant. The best way to get a sense of that is to read papers by academics who implemented cores while adding or removing various features, I think. Like seeing what OOOE does to power and speed here:
http://hps.ece.utexas.edu/people/khubaib/pub/morphcore_micro2012.pdf