By: Patrick Chase (patrickjchase.delete@this.gmail.com), July 3, 2013 9:04 pm
Room: Moderated Discussions
Symmetry (someone.delete@this.somewhere.com) on July 3, 2013 7:14 am wrote:
> anon (anon.delete@this.anon.com) on July 2, 2013 9:03 pm wrote:
> > Small OOOE ARMs like A9r4, if coupled with a large vector unit like Xeon Phi,
> > would be able to achieve the above with OOOE as well, at reasonable flops/W.
>
> Would the vector unit have all the normal OOOE attachments in this example?
> I know NVidia GPUs have used scoreboards to select the next instruction from a
> pool of ready threads and selecting the next instruction out of order wouldn't be
> that much more complicated since you're amortizing cost of increasing the scheduler
> size over the width of the vector.
They're already partially OoO (in the same sense that the CDC6600 was).
NVIDIA GPUs from at least the Fermi and Kepler generations support OoO retirement, but not OoO issue. For example, if you do a load that misses cache (latency of 200-400 clocks) then the SM[X] will continue to issue, execute and retire instructions from the same thread/warp until it encounters some subsequent instruction that uses the result of the load. NVIDIA's optimization guide actually recommends doing loads well before their results are needed for exactly this reason.
Those GPUs don't have reorder buffers, so the net impact of all this is that they don't implement precise interrupts. That's not a big deal in a dedicated compute engine like a GPU, but it's a big no-no in a general-purpose CPU, which brings us to:
> But even then you have to worry about having a consistant state in
> the event of an interrupt and that means a storage requirement that really does
> grow with the width of the vector.
Yep. The fact that general-purpose CPU architectures like x86, ARM, and POWER require precise interrupts adds a nontrivial cost penalty if you want to do any sort of reordering.
The way Intel dealt with this in Xeon Phi is by redefining the architectural exception model - There are no traps for floating point exceptions. If you want to know if something bad happened then you have to look at the corresponding flags. I don't know what IBM did with the PPC A2 but I wouldn't be surprised to see something similar.
> anon (anon.delete@this.anon.com) on July 2, 2013 9:03 pm wrote:
> > Small OOOE ARMs like A9r4, if coupled with a large vector unit like Xeon Phi,
> > would be able to achieve the above with OOOE as well, at reasonable flops/W.
>
> Would the vector unit have all the normal OOOE attachments in this example?
> I know NVidia GPUs have used scoreboards to select the next instruction from a
> pool of ready threads and selecting the next instruction out of order wouldn't be
> that much more complicated since you're amortizing cost of increasing the scheduler
> size over the width of the vector.
They're already partially OoO (in the same sense that the CDC6600 was).
NVIDIA GPUs from at least the Fermi and Kepler generations support OoO retirement, but not OoO issue. For example, if you do a load that misses cache (latency of 200-400 clocks) then the SM[X] will continue to issue, execute and retire instructions from the same thread/warp until it encounters some subsequent instruction that uses the result of the load. NVIDIA's optimization guide actually recommends doing loads well before their results are needed for exactly this reason.
Those GPUs don't have reorder buffers, so the net impact of all this is that they don't implement precise interrupts. That's not a big deal in a dedicated compute engine like a GPU, but it's a big no-no in a general-purpose CPU, which brings us to:
> But even then you have to worry about having a consistant state in
> the event of an interrupt and that means a storage requirement that really does
> grow with the width of the vector.
Yep. The fact that general-purpose CPU architectures like x86, ARM, and POWER require precise interrupts adds a nontrivial cost penalty if you want to do any sort of reordering.
The way Intel dealt with this in Xeon Phi is by redefining the architectural exception model - There are no traps for floating point exceptions. If you want to know if something bad happened then you have to look at the corresponding flags. I don't know what IBM did with the PPC A2 but I wouldn't be surprised to see something similar.