Article: Parallelism at HotPar 2010
By: Steve Underwood (steveu.delete@this.coppice.org), August 18, 2010 3:03 pm
Room: Moderated Discussions
Michael S (already5chosen@yahoo.com) on 8/18/10 wrote:
---------------------------
>Did NVidea disclose the their ISA and provided the tools required for hand optimization? Methinks not.
>On the other hand, hand-optimization for x86 SIMD is very well supported by plenty
>of tools (compiler intrinsics, debuggers, profilers). In short, hand-optimizing
>below standard C/Fortran on Nehalem is practical and done in practice by tens of
>thousands of devs. Hand-optimizing below CUDA on NVidea is not practical and likely
>not even possible for non-NVidea devs.
Although you *can* play with Nehalem code to your heart's content, Intel make SIMD optimisation unnecessarily hard to pick up. The documents need a *lot* of reading to extract the whole picture, and they publish very little example code. Even when they do publish code, its performance isn't usually that impressive compared to simple linear code. I'd love to see some published examples which simply illustrate some calculation sequences that keep every computational slot busy for the various cores. That would give beginners a good starting point for what is and is not achievable.
The huge slowdown which SSSE3 code suffers on in the i5 and I7 compared to Core 2 is frustrating. It took a lot of time for people to get the best out of hand shuffling things with SSSE3, and the next generation core made this complexity something that needs to be ripped out of the code. AAAHHHHH!
Steve
Steve
---------------------------
>Did NVidea disclose the their ISA and provided the tools required for hand optimization? Methinks not.
>On the other hand, hand-optimization for x86 SIMD is very well supported by plenty
>of tools (compiler intrinsics, debuggers, profilers). In short, hand-optimizing
>below standard C/Fortran on Nehalem is practical and done in practice by tens of
>thousands of devs. Hand-optimizing below CUDA on NVidea is not practical and likely
>not even possible for non-NVidea devs.
Although you *can* play with Nehalem code to your heart's content, Intel make SIMD optimisation unnecessarily hard to pick up. The documents need a *lot* of reading to extract the whole picture, and they publish very little example code. Even when they do publish code, its performance isn't usually that impressive compared to simple linear code. I'd love to see some published examples which simply illustrate some calculation sequences that keep every computational slot busy for the various cores. That would give beginners a good starting point for what is and is not achievable.
The huge slowdown which SSSE3 code suffers on in the i5 and I7 compared to Core 2 is frustrating. It took a lot of time for people to get the best out of hand shuffling things with SSSE3, and the next generation core made this complexity something that needs to be ripped out of the code. AAAHHHHH!
Steve
Steve