Article: Parallelism at HotPar 2010
By: hobold (hobold.delete@this.vectorizer.org), August 26, 2010 4:37 am
Room: Moderated Discussions
Michael S (already5chosen@yahoo.com) on 8/23/10 wrote:
---------------------------
[...]
>There are good reasons to prefer single-threaded SIMD over MT scalar. First, the
>change is more "local", other parts of the application less influenced. Second,
>until Nehalem you often could gain more by SIMD than by threading. Third, fewer
>performance surprises due to effects of cache layout.
>Fourth, sometimes other parts of your application could effectively utilize the
>remaining cores. And finally, fifth, not everybody share the mentality of "grab
>as many computing resources as you can and other applications (or user on time sharing machine) can go to hell.
>
My favourite reason to sometimes like SIMD over concurrent threads is the fact that all vector lanes are processed in lockstep (or at least pretend to be). This can sometimes be used to exploit fine grain parallelism despite rather strong data dependencies.
As an example, think of a two dimensional image filter, where each pixel result depends on the left and top neighbours' result (except for the border, as usual). Any scalar implementation would probably just iterate over rows or columns, with a loop carried dependency. (Distance transforms might be the most relevant instance of this pattern.)
The "axis of parallelism" is diagonal, and a SIMD implementation could exploit that. Data dependencies would be handled by the sequential order in which individual vectors are being processed. A multithreaded implementation would need explicit synchronization to ensure that all required neighbours are available. You can't just start processing of row 2 after the row 1 thread has gotten enough of a head start.
(Yes, you can do SIMD and threading simultaneously. But the threading would not be fine grained, but on the level of larger tiles.)
---------------------------
[...]
>There are good reasons to prefer single-threaded SIMD over MT scalar. First, the
>change is more "local", other parts of the application less influenced. Second,
>until Nehalem you often could gain more by SIMD than by threading. Third, fewer
>performance surprises due to effects of cache layout.
>Fourth, sometimes other parts of your application could effectively utilize the
>remaining cores. And finally, fifth, not everybody share the mentality of "grab
>as many computing resources as you can and other applications (or user on time sharing machine) can go to hell.
>
My favourite reason to sometimes like SIMD over concurrent threads is the fact that all vector lanes are processed in lockstep (or at least pretend to be). This can sometimes be used to exploit fine grain parallelism despite rather strong data dependencies.
As an example, think of a two dimensional image filter, where each pixel result depends on the left and top neighbours' result (except for the border, as usual). Any scalar implementation would probably just iterate over rows or columns, with a loop carried dependency. (Distance transforms might be the most relevant instance of this pattern.)
The "axis of parallelism" is diagonal, and a SIMD implementation could exploit that. Data dependencies would be handled by the sequential order in which individual vectors are being processed. A multithreaded implementation would need explicit synchronization to ensure that all required neighbours are available. You can't just start processing of row 2 after the row 1 thread has gotten enough of a head start.
(Yes, you can do SIMD and threading simultaneously. But the threading would not be fine grained, but on the level of larger tiles.)