Check out the new USENIX Web site.
HotPar XIII Banner

WORKSHOP PROGRAM ABSTRACTS

Tech Sessions: Thursday, May 26 | Friday, May 27
Thursday, May 26, 2011
8:30 a.m.–10:00 a.m.

Considerations When Evaluating Microprocessor Platforms
Back to Program
Motivated by recent papers comparing CPU and GPU performance, this paper explores the questions: Why do we compare microprocessors and by what means should we compare them? We distinguish two distinct perspectives from which to make comparisons: application developers and computer architecture researchers. We survey the distinct concerns of these groups, identifying essential information each group expects when interpreting comparisons. We believe the needs of both groups should be addressed separately, as the goals of application developers are quite different from those of computer architects. Reproducibility of results is widely acknowledged as the foundation of scientific investigation. Accordingly, it is imperative that platform comparisons supply enough detail for others to reproduce and contextualize results. As parallel processing continues to increase in importance, and parallel microprocessor architectures continue to proliferate, the importance of conducting and publishing reproducible microprocessor platform comparisons will also increase. We seek to add our voice to the discussion about how these comparisons should be conducted.

RADBench: A Concurrency Bug Benchmark Suite
Back to Program
Testing and debugging tools for concurrent programs are often validated on known bugs. To aid the development of these tools, we present the Race, Atomicity, and Deadlock Benchmark (RADBench) suite. The RADBench suite contains the full source of 10 real concurrency bugs found in large open-source software projects including Mozilla SpiderMonkey, Mozilla NSPR, Memcached, Apache Web Server, and Google Chromium Web Browser. We discuss the difficulties we have found in reproducing these bugs that must be accounted for when building testing and debugging tools. Finally, we propose an approach to reproducibility that has a number of benefits over standard deterministic replay for debugging. RADBench is open source and publicly available.

10:30 a.m.–noon

How to Miscompile Programs with "Benign" Data Races
Back to Program
Several prior research contributions [15, 9] have explored the problem of distinguishing "benign" and harmful data races to make it easier for programmers to focus on a subset of the output from a data race detector. Here we argue that, although such a distinction makes sense at the machine code level, it does not make sense at the C or C++ source code level. In one sense, this is obvious: The upcoming thread specifications for both languages [6, 7] treat all data races as errors, as does the current Posix threads specification. And experience has shown that it is difficult or impossible to specify anything else. [1, 2] Nonetheless many programmers clearly believe, along with [15] that certain kinds of data races can be safely ignored in practice because they will produce expected results with all reasonable implementations. Here we show that all kinds of C or C++ source-level "benign" races discussed in the literature can in fact lead to incorrect execution as a result of perfectly reasonable compiler transformations, or when the program is moved to a different hardware platform. Thus there is no reason to believe that a currently working program with "benign races" will continue to work when it is recompiled. Perhaps most surprisingly, this includes even the case of potentially concurrent writes of the same value by different threads.

Deterministic OpenMP for Race-Free Parallelism
Back to Program
Recent deterministic parallel programming models show promise for their ability to replay computations and reproduce bugs, but they currently require the programmer to adopt restrictive or unfamiliar parallel constructs. Deterministic OpenMP (DOMP) is a new deterministic parallel environment built on the familiar OpenMP framework. By leveraging OpenMP's block-structured synchronization annotations, which are largely compatible with the constraints of a deterministic model, DOMP eases the parallelization of legacy serial code and preserves substantial compatibility with OpenMP software. A few OpenMP constructs, however, such as atomic and critical, are semantically nondeterministic and unsupported in DOMP. In three well-known parallel benchmark suites, we find that most (81%) uses of these nondeterministic constructs express programming idioms that are compatible with determinism but not directly expressible in OpenMP. DOMP therefore adds new OpenMP constructs to express such idioms deterministically, supporting pipelines and generalized reductions.

1:30 p.m.–3:00 p.m.

Parkour: Parallel Speedup Estimates for Serial Programs
Back to Program
We present Parkour, a tool that creates parallel speedup estimates for unparallelized serial programs. Unlike previous approaches, it does not require any prior human analysis or modification of the program. Parkour automatically quantifies the parallelism of a given program and provides an approximate upper bound for performance, modeling fundamental parallelization constraints. For the evaluation, Parkour is applied to three benchmarks from the NAS Parallel benchmark suite running on a 32-core AMD multicore system, and three benchmarks running on the fine-grained MIT Raw processor. The results are compelling. Parkour is able to significantly improve the accuracy of parallel speedup estimates relative to the critical path analysis technique that it extends.

Enabling Multiple Accelerator Acceleration for Java/OpenMP
Back to Program
While using a single GPU is fairly easy, using multiple CPUs and GPUs potentially distributed over multiple machines is hard because data needs to be kept consistent using message exchange and the load needs to be balanced. We propose (1) an array package that provides partitioned and replicated arrays and (2) a compute-device library to abstract from GPUs and CPUs and their location. Our system automatically distributes a parallel-for loop in data-parallel fashion over all the devices. There are three contributions in this paper. First, we provide transparent use of multiple distributed GPUs and CPUs from within Java/OpenMP. Second, we partition arrays according to the compute-devices' relative performance that is computed from the execution time of a small micro benchmark and a series of small bandwidth tests run at program start. Third, we repartition the arrays dynamically at run-time by increasing or decreasing the number of machines used and by switching from CPUs-only to GPUs-only, to combinations of CPUs and GPUs, and back. With our dynamic device switching we minimize communication while maximizing device use. Our system automatically finds the optimal device sets and achieves a speedup of 5 – 200 on a cluster of 8 machines with 2 GPUs each.

3:30 p.m.–5:00 p.m.

CUDA-level Performance with Python-level Productivity for Gaussian Mixture Model Applications
Back to Program
Typically, scientists with computational needs prefer to use high-level languages such as Python or MATLAB; however, large computationally-intensive problems must eventually be recoded in a low level language such as C or Fortran by expert programmers in order to achieve sufficient performance. In addition, multiple strategies may exist for mapping a problem onto parallel hardware depending on the input data size and the hardware parameters. We show how to preserve the productivity of high-level languages while obtaining the performance of the best low-level language code variant for a given hardware platform and problem size using SEJITS, a set of techniques that leverages just-in-time code generation and compilation. As a case study, we demonstrate our technique for Gaussian Mixture Model training using the EM algorithm. With the addition of one line of code to import our framework, a domain programmer using an existing Python GMM library can run her program unmodified on a GPU-equipped computer and achieve performance that meets or beats GPU code hand-crafted by a human expert. We also show that despite the overhead of allowing the domain expert's program to use Python and the overhead of just-in-time code generation and compilation, our approach still results in performance competitive with hand-crafted GPU code.

Pervasive Parallelism for Managed Runtimes
Back to Program
Modern programming languages like C# or Java are executed in a managed runtime and offer support for concurrency at a high level of abstraction. However, high-level parallel abstractions (e.g., thread pools) can merely be provided as a library since the underlying runtime (including the dynamic compiler) is aware of only a small set of low-level parallel abstractions. In this paper we discuss alternative abstractions of concurrency in the source language, the runtime, and the dynamic compiler. The abstractions enable a dynamic optimizing compiler to perform new code transformations that can adapt the granularity of parallel tasks according to the system resources. The presented optimizations allow the runtime to tune the execution of parallel code fully automatically.

Friday, May 27, 2011
8:30 a.m.–10:00 a.m.

Balance Principles for Algorithm-Architecture Co-Design
Back to Program
We consider the problem of "co-design," by which we mean the problem of how to design computational algorithms for particular hardware architectures and vice-versa. Our position is that balance principles should drive the co-design process. A balance principle is a theoretical constraint equation that explicitly relates algorithm parameters to hardware parameters according to some figure of merit, such as speed, power, or cost. This notion originates in the work of Kung (1986); Callahan, Cocke, and Kennedy (1988); and McCalpin (1995); however, we reinterpret these classical notions of balance in a modern context of parallel and I/O-efficient algorithm design as well as trends in emerging architectures. From such a principle, we argue that one can better understand algorithm and hardware trends, and furthermore gain insight into how to improve both algorithms and hardware. For example, we suggest that although matrix multiply is currently compute-bound, it will in fact become memory-bound in as few as ten years—even if last-level caches grow at their current rates. Our overall aim is to suggest how to co-design rigorously and quantitatively while still yielding intuition and insight.

Crunching Large Graphs with Commodity Processors
Back to Program
Crunching large graphs is the basis of many emerging applications, such as social network analysis and bioinformatics. Graph analytics algorithms exhibit little locality and therefore present significant performance challenges. Hardware multi-threading systems (e.g., Cray XMT) show that with enough concurrency, we can tolerate long latencies. Unfortunately, this solution is not available with commodity parts. Our goal is to develop a latency-tolerant system built out of commodity parts and mostly in software. The proposed system includes a runtime that supports a large number of lightweight contexts, full-bit synchronization and a memory manager that provides a high-latency but high-bandwidth global shared memory. This paper lays out the vision for our system and justifies its feasibility with a performance analysis of the run-time for latency tolerance.

10:30 a.m.–noon

Multicore Performance Optimization Using Partner Cores
Back to Program
As the push for parallelism continues to increase the number of cores on a chip, system design has become incredibly complex; optimizing for performance and power efficiency is now nearly impossible for the application programmer. To assist the programmer, a variety of techniques for optimizing performance and power at runtime have been developed, but many employ the use of speculative threads or performance counters. These approaches result in stolen cycles, or the use of an extra core, and such expensive penalties can greatly reduce the potential gains. At the same time that general purpose processors have grown larger and more complex, technologies for smaller embedded processors have pushed towards energy efficiency. In this paper, we combine the two and introduce the concept of Partner Cores: low-area, low-power cores paired with larger, faster compute cores. A partner core is tightly coupled to each main processing core, allowing it to perform various optimizations and functions that are impossible on a traditional chip multiprocessor. This paper demonstrates that optimization code running on a partner core can increase performance and provide a net improvement in power efficiency.

Parallel Pattern Detection for Architectural Improvements
Back to Program
With the shift in general purpose computing to increasingly parallel architectures comes a need for clever architectures to achieve high parallelism on previously sequential or poorly parallelized code. In order to fully utilize the many-core systems of the present and future, a shift must occur in architecture design philosophy to understanding how the parallel programming process affects design decisions. Parallel patterns provide a way to create parallel code for a wide variety of algorithms. Additionally they provide a convenient classification mechanism that is both understandable to programmers and that exhibit similar behaviors that can be architecturally exploited. In this work we explore the capabilities of pattern driven dynamic architectures as well as detection mechanisms useful for dynamic and static parallel pattern recognition.

1:30 p.m.–3:00 p.m.

Parallel Programming of General-Purpose Programs Using Task-Based Programming Models
Back to Program
The prevalence of multicore processors is bound to drive most kinds of software development towards parallel programming. To limit the difficulty and overhead of parallel software design and maintenance, it is crucial that parallel programming models allow an easy-to-understand, concise and dense representation of parallelism. Parallel programming models such as Cilk++ and Intel TBBs attempt to offer a better, higher-level abstraction for parallel programming than threads and locking synchronization. It is not straightforward, however, to express all patterns of parallelism in these models. Pipelines are an important parallel construct, although difficult to express in Cilk and TBBs in a straightforward way, not without a verbose restructuring of the code. In this paper we demonstrate that pipeline parallelism can be easily and concisely expressed in a Cilk-like language, which we extend with input, output and input/output dependency types on procedure arguments, enforced at runtime by the scheduler. We evaluate our implementation on real applications and show that our Cilk-like scheduler, extended to track and enforce these dependencies has performance comparable to Cilk++.

Parallel Programming with Inductive Synthesis
Back to Program
We show that program synthesis can generate GPU algorithms as well as their optimized implementations. Using the scan kernel as a case study, we describe our evolving synthesis techniques. Relying on our synthesizer, we can parallelize a serial problem by transforming it into a scan operation, synthesize a SIMD scan algorithm, and optimize it to reduce memory conflicts.

3:30 p.m.–5:00 p.m.

A Relativistic Enhancement to Software Transactional Memory
Back to Program
Relativistic Programming is a technique that allows low overhead, linearly-scalable concurrent reads. It also allows joint access parallelism between readers and a writer. Unfortunately, it has so far been limited to a single writer so it does not scale on the write side. Software Transactional Memory (STM) is a technique that allows programs to take advantage of disjoint access parallelism on both the read-side and write-side. Unfortunately, STM systems have a higher overhead than many other synchronization mechanisms so although STM scales, STM starts from a lower baseline. We propose combining relativistic programming and software transactional memory in a way that gets the best of both worlds: low-overhead linearly-scalable reads that never conflict with writes and scalable disjoint access parallel writes. We argue for the correctness of our approach and present performance data that shows an actual implementation that delivers the promised performance characteristics.

Quarantine: Fault Tolerance for Concurrent Servers with Data-Driven Selective Isolation
Back to Program
We present Quarantine, a system that enables data-driven selective isolation within concurrent server applications. Instead of constructing arbitrary isolation boundaries between components, Quarantine collects data to learn where such boundaries should be placed, and then instantiates said barriers to improve reliability. We present the case for data-driven selective isolation, and discuss the challenges in realizing such a system.

?Need help? Use our Contacts page.

Last changed: 25 April 2011 jel