By: Patrick Chase (patrickjchase.delete@this.gmail.com), July 2, 2013 12:34 pm
Room: Moderated Discussions
Linus Torvalds (torvalds.delete@this.linux-foundation.org) on July 2, 2013 12:12 pm wrote:
> ⚛ (0xe2.0x9a.0x9b.delete@this.gmail.com) on July 2, 2013 11:38 am wrote:
> >
> > Well, isn't it true that it is possible to utilize FP registers and FP logic in integer
> > workloads? This would imply that basic FPU operations such as addition, subtraction,
> > multiplication and comparison need to run about as fast as in the integer ALU. Moves
> > between FP registers and [INT registers or memory] need to be fast as well.
>
> That's particularly stupid.
Agreed, but for slightly different reasons (see post).
> Any CPU where FP addition is as fast as integer ops is a f*cking disaster. FP addition
> is fundamentally much more complicated than an integer add, and if that doesn't show
> up in timing, the CPU is pure and utter shit. It really is that simple.
When you say "as fast" are you speaking of throughput or latency? If you're addressing latency, then I agree 100%. All you have to do is look at the series of logical operations to realize that FP add must have many more gate delays and therefore higher latency (and I'm not even considering things like denorm handling - It's true even if you enable FTZ).
With that said, most modern FP units are fully pipelined and do indeed have the same throughput as an integer ALU, provided you can find enough ILP to hide the latency and keep the pipeline fed. Those pipelines also cost a lot more than integer ALUs of course.
I just realized that I forgot to spell out a third reason why this doesn't make sense on modern CPUs in my previous post: Issue port restrictions. Cores like SB/IB/Haswell feed multiple functional units from each issue port, and those units are of different "types". For example, Haswell and SB/IB port 0 feeds an integer unit, AVX Vmul/Vshift, and AVX FMA/Fblend. This means that substituting FP code for integer doesn't increase the number of operations that you can concurrently issue, and so once again it doesn't pay even if the FP HW is a sunk cost (in which case the complexity argument you raised is moot).
> Floating point is not only fundamentally slower and more complex than integer math,
> it's also fundamentally more likely to lead to bugs. The precision and
> underflow/overflow behavior of floating point is really really complicated, and
> easy to get wrong. To the point that it is not horribly uncommon to do the exact
> reverse of what you suggest: instead of using the FPU for integer math,
> lots of competent people use the integer unit for FP (where the "F" then is often
> for "Fixed", not "Floating").
100% true. It's probably pretty obvious from my posts that I come from an imaging/DSP background, and people like me often avoid FP like the plague (though unlike you we like our VLIWs :-). For many tasks 16.16 or 1.31 math is more than adequate (expecially if the core supports saturating integer ops via intrinsics) and it's usually faster. The only major exception is if I'm working with a GPU.
> ⚛ (0xe2.0x9a.0x9b.delete@this.gmail.com) on July 2, 2013 11:38 am wrote:
> >
> > Well, isn't it true that it is possible to utilize FP registers and FP logic in integer
> > workloads? This would imply that basic FPU operations such as addition, subtraction,
> > multiplication and comparison need to run about as fast as in the integer ALU. Moves
> > between FP registers and [INT registers or memory] need to be fast as well.
>
> That's particularly stupid.
Agreed, but for slightly different reasons (see post).
> Any CPU where FP addition is as fast as integer ops is a f*cking disaster. FP addition
> is fundamentally much more complicated than an integer add, and if that doesn't show
> up in timing, the CPU is pure and utter shit. It really is that simple.
When you say "as fast" are you speaking of throughput or latency? If you're addressing latency, then I agree 100%. All you have to do is look at the series of logical operations to realize that FP add must have many more gate delays and therefore higher latency (and I'm not even considering things like denorm handling - It's true even if you enable FTZ).
With that said, most modern FP units are fully pipelined and do indeed have the same throughput as an integer ALU, provided you can find enough ILP to hide the latency and keep the pipeline fed. Those pipelines also cost a lot more than integer ALUs of course.
I just realized that I forgot to spell out a third reason why this doesn't make sense on modern CPUs in my previous post: Issue port restrictions. Cores like SB/IB/Haswell feed multiple functional units from each issue port, and those units are of different "types". For example, Haswell and SB/IB port 0 feeds an integer unit, AVX Vmul/Vshift, and AVX FMA/Fblend. This means that substituting FP code for integer doesn't increase the number of operations that you can concurrently issue, and so once again it doesn't pay even if the FP HW is a sunk cost (in which case the complexity argument you raised is moot).
> Floating point is not only fundamentally slower and more complex than integer math,
> it's also fundamentally more likely to lead to bugs. The precision and
> underflow/overflow behavior of floating point is really really complicated, and
> easy to get wrong. To the point that it is not horribly uncommon to do the exact
> reverse of what you suggest: instead of using the FPU for integer math,
> lots of competent people use the integer unit for FP (where the "F" then is often
> for "Fixed", not "Floating").
100% true. It's probably pretty obvious from my posts that I come from an imaging/DSP background, and people like me often avoid FP like the plague (though unlike you we like our VLIWs :-). For many tasks 16.16 or 1.31 math is more than adequate (expecially if the core supports saturating integer ops via intrinsics) and it's usually faster. The only major exception is if I'm working with a GPU.