The Landscape of Parallelism Research
In recent years, the uniprocessor scalability problems have become well known. Though Moore’s Law will continue unabated for at least one more decade, transmuting this abundance of transistors into single-threaded performance has become increasingly difficult due to physical constraints such as power and interconnect scaling. Furthermore, the amount of instruction level parallelism (ILP) in typical codes is pitifully small compared to the peak processing prowess of modern high performance processors.
The current hardware focus for every new process node is spending the extra transistors to clone existing cores, or integrate more features. At the same time, graphics processing unit (GPU) hardware is increasingly designed to appeal to more general purpose workloads, such as high performance computing (HPC) applications. However these new cores (whether CPU or GPU) and extra threads require parallel software to fully utilize them. Recognizing this trend, researchers in industry and academia have devoted increasing time and resources towards new methods and algorithms for parallelizing existing software, novel concurrent programming models, better tools to work with and debug existing parallel software, and parallelism education and awareness in the computing industry.
In 2006 a group of researchers led by the eminent Dr. Dave Patterson from UC Berkeley issued a position paper  on goals and recommendations for exploring the future of parallelism. Several questions were posited regarding potential parallel applications, architecture, programming models, and metrics. One notable outgrowth of the paper was the advocacy of utilizing 13 “Dwarfs” (or classes of problems) exhibiting important parallel targets expected to play a key role in future applications as new benchmarks for evaluating success in this endeavor.
In recent years  this view has been refined to instead focus on composable parallel patterns as primitives in building parallel software. Nonetheless there have been many groups with different perspectives on tackling the parallelism problem and a new forum was advocated for sharing results and advancing research. Starting in 2009 Dr. Patterson and others from industry and academia have organized the annual USENIX Workshop on Hot Topics in Parallelism, or HotPar. This year’s HotPar workshop was held on the campus of UC Berkeley over a couple of days in mid-June.
This article attempts to summarize the results from a few presented works and outline the general milieu encountered regarding the topic of parallelism. In no way is this article a comprehensive account of the papers and presentations given at the workshop. Interested readers are advised to peruse the proceedings of HotPar 2010 for detailed results .
Parallelism is manifested in many forms. Server systems accepting many requests from clients are inherently concurrent in nature as the different requests are processed independently. Embarrassingly Parallel (EP) problems is another class where there is abundant scaling potential and little to no communication and synchronization between parallel data-intensive tasks.
The difficult problem in parallelism is in finding new and relatively straightforward ways to parallelize different parts of previously serial applications and algorithms. For a parallel algorithm to scale though, it must enhance locality to improve caching and minimize the burden on the memory hierarchy. Additionally, the algorithm must ensure correctness while reducing synchronization, to enable each parallel processing element to make forward progress, rather than falling victim to Amdahl’s Law. Otherwise, it is possible that the parallel approach will not yield any gains relative to a serial approach. The main focus of parallelism research is on both client applications that would be used for desktops or notebooks, as well as novel applications which do not currently exist.
The world of parallel programming is complex and varied. In thinking about parallel programming, it is helpful to categorize the different dimensions and aspects. One essential dimension is the granularity of data synchronization. In fine-grained synchronization, each individual data item is synchronized. Coarser-grained approaches work with many data items at once, for example by synchronizing entire streams or data flows in a bulk fashion. Another dimension is whether the problem is regular, as in traditional HPC with matrices and vectors, or irregular with more complex structures such as graphs, trees or other structures relying on points and indirection. Yet another dimension is whether data sharing is simple, as in mostly private data per thread or core, to moderate complexity, where data sharing is more extensive and changes perhaps accumulated in a log like fashion, to complex sharing where state is centralized and changes occur through mutation of existing data and structures.
One further dimension to consider is not algorithmic, but instead tied to the implementation and programming model. In particular, whether threads are an explicitly exposed to programmers , or implicitly enabled through runtime support. This is something of a red herring as the salient feature is preemptive multithreading; whereby the programmer has to reason about the order of the thread interleavings. This is the source of many hard to reproduce concurrency bugs and race conditions. Programming models with cooperative scheduling would reduce or eliminate many of these errors. Still it is important to distinguish threads as a design point in the spectrum.
From a more practical software development perspective, one school of thought is to bifurcate the concurrency problem into two separate pieces. The first piece would be a set of sophisticated parallel library implementations that are highly optimized for parallel algorithms and the underlying hardware that they execute upon. The second element would be a set of relatively high level library clients, that are possibly concurrency aware at the algorithmic level, are not primarily concerned with the details. This sort of specialization will ultimately be necessary, given that not all developers can effectively handle concurrency. The fundamental question is how to divide responsibilities between the two types of code and the two types of programmers. Should there be a total abstraction between the two, where oblivious scripting code calls into hand optimized lock free libraries while providing little information about the greater context? Or should the two specialized classes of code and programmers work together and enable a more tightly coupled interaction where information can be passed between the two layers? One benefit of the former approach is that parallelism is instantly made available to millions of developers, while the latter approach encourages adept client developers to look at and perhaps modify the libraries they are calling and is likely to yield larger gains and fewer nasty pathological corner cases.