By: Patrick Chase (patrickjchase.delete@this.gmail.com), July 2, 2013 11:56 am
Room: Moderated Discussions
⚛ (0xe2.0x9a.0x9b.delete@this.gmail.com) on July 2, 2013 11:38 am wrote:
> Well, isn't it true that it is possible to utilize FP registers and FP logic in
> integer workloads? This would imply that basic FPU operations such as addition,
> subtraction, multiplication and comparison need to run about as fast as in the
> integer ALU. Moves between FP registers and [INT registers or memory] need to be
> fast as well.
It's possible, but you won't see it done much for two reasons:
1. FP/integer conversions are fairly expensive on modern CPU cores even when they don't cross "domains". For example cvtsi2ss costs 3-5 uops on many x86 microachitectures even though it's entirely "contained" within the SSE domain.
2. The forwarding networks in PRF-based machines often do not cross domains (FP/SSEinteger etc) for reasons I gave in an earlier post today [*]. This means that domain crossings add a fair bit of latency and make it more difficult to keep the entire core busy.
Where I have seen this trick used profitably (and have used it myself) is in GPUs. Those are different in 3 important ways:
1. The texture units are FP-only, and can be used to accelerate/offload certain common tasks (basically any transform that can be acceptably approximated by a <=3D linear interpolation within a regular sparse mesh).
2. FPinteger conversions typically require only 1 operation
3. The heavily multithreaded nature of a GPU means that you don't have to worry about latency as you would on a CPU, since the core can simply switch threads.
Taken together these mean that it's often a win if you can accelerate part of your nominally integer application in FP via the texture unit, and then convert to integer and finish it on the shader core.
-- Patrick
[*] Briefly, if you do an all-to-all forwarding network in a PRF-based machine then it ends up costing about as much power/area as the common result bus in an RS-based machine. Microarchitects avoid this by forcing "unlikely" dependency paths to go through the register file.
> Well, isn't it true that it is possible to utilize FP registers and FP logic in
> integer workloads? This would imply that basic FPU operations such as addition,
> subtraction, multiplication and comparison need to run about as fast as in the
> integer ALU. Moves between FP registers and [INT registers or memory] need to be
> fast as well.
It's possible, but you won't see it done much for two reasons:
1. FP/integer conversions are fairly expensive on modern CPU cores even when they don't cross "domains". For example cvtsi2ss costs 3-5 uops on many x86 microachitectures even though it's entirely "contained" within the SSE domain.
2. The forwarding networks in PRF-based machines often do not cross domains (FP/SSEinteger etc) for reasons I gave in an earlier post today [*]. This means that domain crossings add a fair bit of latency and make it more difficult to keep the entire core busy.
Where I have seen this trick used profitably (and have used it myself) is in GPUs. Those are different in 3 important ways:
1. The texture units are FP-only, and can be used to accelerate/offload certain common tasks (basically any transform that can be acceptably approximated by a <=3D linear interpolation within a regular sparse mesh).
2. FPinteger conversions typically require only 1 operation
3. The heavily multithreaded nature of a GPU means that you don't have to worry about latency as you would on a CPU, since the core can simply switch threads.
Taken together these mean that it's often a win if you can accelerate part of your nominally integer application in FP via the texture unit, and then convert to integer and finish it on the shader core.
-- Patrick
[*] Briefly, if you do an all-to-all forwarding network in a PRF-based machine then it ends up costing about as much power/area as the common result bus in an RS-based machine. Microarchitects avoid this by forcing "unlikely" dependency paths to go through the register file.