Structured Parallel Programming
The talk by Dr. Michael McCool of Intel focused on structured parallel patterns applied in combinations to perform useful computation. His presentation delved nicely with the current thinking among many Berkeley researchers. Dr. McCool previously co-founded RapidMind with one of his graduate students while at the University of Waterloo. His startup focused on providing a high level programming platform and constructs for expressing parallel compute intensive algorithms on GPUs and multicore SIMD enabled CPUs. RapidMind was acquired by Intel in 2009 and folded into a similar effort: The Ct data parallel programming environment.
Dr. McCool presented a catalog of deterministic parallel patterns, defined as commonly occurring combinations of task distribution and data access. A small number of such parallel primitives can support most interesting algorithms implemented at a high level. His thesis is that a system that supports determinism and application oriented patterns results in better usability (e.g. debugging, programmer visibility) and higher productivity. These patterns are derived from common use cases, which should ensure universal applicability. By using such methods high level reasoning is encouraged which allows developers to focus on what matters: the parallelism and data locality.
An analogy was drawn with the arguments over structured control flow in the 60’s and 70’s. Some people had been advocating the continued use of goto as it allows certain control flow scenarios that are rather difficult to implement with higher level constructs such as if/for/while. Others such as Wirth eschewed gotos as they rendered control flow through code into a “spaghetti” mess, and advocated other measures for expressive “power”. Two categories of patterns were outlined: computational and data access patterns.
Figure 4 – Map operator
As shown above in Figure 4, the familiar map pattern (popularized by Google in the MapReduce framework) applies a function to every element of an input container producing an output container with the resulting elements. There is no sequence typically specified with map, and if the given function is side effect free then the pattern expresses a nearly unlimited amount of deterministic computation. A map implementation can benefit from spatial locality, parallelization and vectorization. A reduction pattern, shown below in Figure 5, as the name implies, reduces the elements of a container to a single datum through an associative function. A category reduction applies a tag to elements whereby reduction occurs for all elements of a given tag. This particular variant prefixed with the map pattern is the core of the MapReduce framework. To give a concrete example, calculating the sum of squares for a data set is easily expressed as a map and reduction. First, every element in a data set is mapped to its square, and then all the squared numbers are reduced by summation to a single result.
Figure 5 – Reduce operator
A superscalar sequence pattern represents a task graph of executing operations with data and control dependencies. The control dependencies may be broken through speculative execution, allowing potentially independent tasks to execute in parallel. A pipeline pattern is a set of communicating simultaneously active tasks at different stages of an algorithm arranged in a producer/consumer fashion. This pattern permits locality optimization within stages which is why GPUs and CELL featuring local addressable memories are especially suited for such computation. Many of these patterns are familiar to processor design and architecture. The nesting pattern, present in the Cilk programming model, can allow for arbitrary recursively nested computation in a lattice like fashion, with good locality properties.
Figure 6 – Superscalar sequence
Data access patterns include the gather/scatter patterns. A gather reads elements from disjoint locations in memory and arranges them in a linear fashion, suitable for data processing. A scatter writes elements in a container to disjoint locations in memory. Scatter patterns can be problematic since they are not deterministic if multiple writes target the same memory location. A priority scatter uses a set of rules rule to implement tie-breaking between conflicting writes to ensure determinism and faithfulness to a serial implementation. This latter property is essential as it permits the developer to reason about the code using straightforward serial semantics. The pack pattern (usually combined with the map pattern) transforms sparse containers into dense ones, enhancing locality.
Subdivision patterns divide data into smaller sets that can be operated on in parallel. The partition operator divides a container of data into equally sized non-overlapping sets of containers. A variant of partitioning is segmentation, which is identical except that the sets of containers are not required to be equally sized.
Finally in the stencil pattern, neighborhoods of an element are accessed (exhibiting spatial locality) and operated on in a regular fashion. There are many cases that have to be optimized or given special consideration. Accessing elements past the end of the indices have to be dealt with (by clamping or wrapping); and computations accessing neighborhoods in certain directions can be implemented more efficiently. Although the patterns are deterministic to ensure a repeatable composition, if non-determinism can be isolated to a specific part of the computation with the overall result made deterministic, then the program can still be repeatable.