HotPar '10 Poster Session

POSTER SESSION

Accepted Posters

The Poster Session will include all the talks in the program, as well as the papers listed here.

General-Purpose vs. GPU: Comparison of Many-Cores on Irregular Workloads
Back to List of Accepted Posters
XMT is a general-purpose many-core parallel architecture. The foremost design objective for XMT was to meet the highest standards for ease of parallel programming. GPUs, on the other hand, have acquired a strong reputation on performance, sometimes at the expense of ease-of-programming. The current paper presents a performance comparison on diverse workloads between XMT and an NVIDIA CUDA-enabled GPU. Configured with roughly the same amount of chip resources as the GPU, XMT achieves an average speedup of 4.67x on irregular applications, while incurring an average slowdown of 2.89x on regular ones. Namely, instead of paying a (possibly worthwhile) price for easier programming, XMT comes ahead for significant applications. This surprising result suggests a yet untapped opportunity: A high-performance easy-to-program general-purpose 1000-core computer.

DeNovo: Rethinking Hardware for Disciplined Parallelism
Back to List of Accepted Posters
We believe that future large-scale multicore systems will require disciplined parallel programming practices, including data-race-freedom, deterministic-by-default semantics, and structured, explicit parallel control and side-effects. We argue that this software evolution presents far-reaching opportunities for parallel hardware design to greatly improve complexity, power-efficiency, and performance scalability. The DeNovo project is rethinking hardware design from the ground up to exploit these opportunities. This paper presents the broad research agenda of DeNovo, including a holistic rethinking of cache coherence, memory consistency, communication, and cache architecture.

A Case for Software Managed Coherence in Many-core Processors
Back to List of Accepted Posters
Processor vendors are integrating more and more cores into their chip. These many-core processors usually implement hardware coherence mechanisms, but when the core count goes to hundreds or more, it becomes prohibitively difficult to design and verify efficient hardware coherence support. Despite this, many parallel applications, for example RMS applications [9], show little data sharing, which suggests that providing a complex hardware coherence implementation wastes hardware budget and design effort. Moreover, in some increasingly important domains, such as server and cloud computing, multiple applications may run on a single many-core chip. Those applications require coherence support among the cores that they are running on, but not between different applications. This indicates a strong requirement for dynamically reconfigurable coherence domains, which is extremely hard to support with hardware-only mechanisms. In addition, hardware coherence is believed to be too complex to support heterogeneous platforms such as combined CPU-GPU systems, regardless whether the GPU is integrated or discrete. In this paper, we argue that software managed coherence is a better choice than hardware coherence for many-core processors. We believe that software managed coherence can make better use of silicon, efficiently support emerging applications, dynamically reconfigure coherence domain, and most importantly, still be able to provide performance comparable to hardware coherence. We implemented a prototype system with software managed coherence over a partially-shared address space and show promising results.

Virtues and Obstacles of Hardware-assisted Multi-processor Execution Replay
Back to List of Accepted Posters
The trend towards multi-core processors has spurred a renewed interest in developing parallel programs. Parallel applications exhibit non-deterministic behavior across runs, primarily due to the changing inter-leaving of shared-memory accesses from run to run. The non-deterministic nature of these programs lead to many problems, which will be described in this paper. In order to get a handle on the non-determinism, techniques to record and replay these programs have been proposed. These allow capturing the non-determinism and repeat the execution of a program, hence enabling better understanding and eliminating problems due to non-determinism. In this paper, we argue that an efficient record and replay (R&R) system must be assisted by hardware support. However, adding support for R&R in hardware requires strong justification, in terms of added value to the processor. Up until now, debugging has been the primary motivation used by previous work. We argue that stronger usage models are required to justify the cost of adding hardware support for R&R. Hence the first contribution of this paper are additional usage models (the virtues of R&R) and how we think they will add value to future micro-processors. We think the current lack of focus on wider and more applicable usage models is an obstacle for its adoption. The other obstacle for implementation of hardware-assisted R&R is that some complexities of such endeavor have not been addressed and studied properly. Current research approaches have shortcomings that prevent them from becoming an implementable feature. In this paper, we also identify those shortcomings and indicate research directions to bridge the gap between the research and a product implementation.

Prospector: A Dynamic Data-Dependence Profiler to Help Parallel Programming
Back to List of Accepted Posters
Multiprocessor architectures are increasingly common these days. In order to fully and efficiently utilize the abundant hardware parallelism, translating many sequential programs into parallel programs is a pressing need. Although many compilers support automatic parallelization, most programmers are still manually parallelizing their applications. To help parallelizing applications especially legacy programs written by other programmers, we propose Prospector. Prospector is a profile-based parallelism identification tool using dynamic data dependence profiling results. It also advises on how to parallelize the identified sections. We demonstrate that Prospector is able to discover potentially parallelizable loops that are missed by state-of-the-art production compilers. We illustrate its potential on guiding parallel programming for programmers who did not know serial code. This paper also discusses technical challenges on implementing Prospector.

Enabling Legacy Applications on Heterogeneous Platforms
Back to List of Accepted Posters
In this paper we make the case for a runtime technique to seamlessly execute legacy applications on heterogeneous platforms consisting of CPUs and accelerators. We consider discrete as well as integrated heterogeneous platforms. In the former, CPU and accelerators have different memory systems; in the latter, accelerators share physical memory with the CPU. Our proposed runtime does not require any code changes to be made to the application. It automatically schedules compute-intensive routines found in legacy code on suitable computing resources, while reducing data transfer overhead between the CPU and accelerators. To reduce data movement, our runtime defers data transfers between different memory systems, and attempts to move computations to data instead of vice-versa. This could create multiple copies of the data — one on the CPU, and the others on the accelerators — leading to coherence issues. To address this problem, we propose adding an operating system module that maintains coherence by intercepting accesses to shared data and forcing synchronization. Thus, by exploiting existing mechanisms found in system software, we architect a non-intrusive technique to enable legacy applications take advantage of heterogeneous platforms. With neither software changes nor additional hardware support, the proposed system provides a unified compute and memory view to the application programmer.

A Principled Kernel Testbed for Hardware/Software Co-Design Research
Back to List of Accepted Posters
Recently, advances in processor architecture have become the driving force for new programming models in the computing industry, as ever newer multicore processor designs with increasing number of cores are introduced on schedules regimented by marketing demands. As a result, collaborative parallel (rather than simply concurrent) implementations of important applications, programming languages, models, and even algorithms have been forced to adapt to these architectures to exploit the available raw performance. We believe that this optimization regime is flawed. In this paper, we present an alternate approach that, rather than starting with an existing hardware/software solution laced with hidden assumptions, defines the computational problems of interest and invites architects, researchers and programmers to implement novel hardware/software co-designed solutions. Our work builds on the previous ideas of computational dwarfs, motifs, and parallel patterns by selecting a representative set of essential problems for which we provide: An algorithmic description; scalable problem definition; illustrative reference implementations; verification schemes. This testbed will enable comparative research in areas such as parallel programming models, languages, auto-tuning, and hardware/software co-design. For simplicity, we focus initially on the computational problems of interest to the scientific computing community but proclaim the methodology (and perhaps a subset of the problems) as applicable to other communities. We intend to broaden the coverage of this problem space through stronger community involvement.

Determinism Should Ensure Deadlock-Freedom
Back to List of Accepted Posters
The advent of multicore processors has made concurrent programming models mandatory. However, most concurrent programming models come with two major pitfalls: non-determinism and deadlocks. By determinism, we mean the output behavior of the program is independent of the scheduling choices (e.g., the operating system) and depends only on the input behavior. A few concurrent models provide deterministic behavior by providing constructs that impose additional synchronization, but improper or out-of-order use of these constructs leads to problems like deadlocks. In this paper, we argue for both determinism and deadlock-freedom. We propose a design of a deterministic, deadlock-free model. Any program that uses this model is guaranteed to produce the same output for a given input. Additionally, the program will never deadlock: the program will either terminate or run forever.

Checking Non-Interference in SPMD Programs
Back to List of Accepted Posters
We study one of the basic multicore and GPU programming models, namely, SPMD (Single-Program Multiple-Data) programs. We define a formal model of SPMD programs based on interleaving threads that manipulate global and local arrays, and synchronize via barriers. SPMD programs are written with the intention to be deterministic, although programming errors may result in this not being true. SPMD programs are also frequently modified toward optimal performance. These facts motivate the need for methods to check determinism and program equivalence. A key property in achieving this is non-interference. We formulate non-interference as validity of logical formulas automatically derived from the program, we show that non-interference implies determinism, and we report on a prototype that can prove non-interference of NVIDIA CUDA programs.

Energy-Performance Trade-off Analysis of Parallel Algorithms
Back to List of Accepted Posters
Energy consumption by computer systems has emerged as an important concern, both at the level of individual devices (limited battery capacity in mobile systems) and at the societal level (the production of Green House Gases). In multicore architectures, applications may be executed on a variable number of cores and these cores may operate at different frequencies. The performance and energy cost of a parallel algorithm executing on a multicore architecture have different trade-offs, depending on how many cores the algorithm uses, at what frequencies these cores operate, and the structure of the algorithm. We show how algorithm designers and software developers can analyze the energy-performance trade-off in parallel algorithms. We believe that such analyses should be applied to parallel algorithms to facilitate energy conservation.

Bridging the Parallelization Gap: Automating Parallelism Discovery and Planning
Back to List of Accepted Posters
Multicore processors have forced mainstream programmers to rethink the way they design software. Parallelism will be the avenue for performance gains in these multicore processors but will require new tools and methodologies to gain full acceptance by everyday programmers. As a step towards improved parallelization tools, we propose a parallelization taxonomy that categorizes tools based on which of five fundamental stages of parallelization they assist with. Based on this taxonomy, we find that many popular parallelization tools focus on the final stages, leaving the programmer to perform the initial stages without assistance. In this paper we provide a preliminary description of pyrprof, a tool that helps the programmer locate parallel regions of code and decide which regions to parallelize first. pyrprof performs dynamic critical path analysis and utilizes the structure of programs to highlight exploitable forms of parallelism. A case study based on MPEG encoding is used to demonstrate pyrprof's effectiveness.

Leveraging Semantics Attached to Function Calls to Isolate Applications from Hardware
Back to List of Accepted Posters
To improve performance, computer systems are forcing more microarchitectural and parallel hardware details to be directly exploited by application programmers, exposing limitations in existing compiler and OS infrastructure, which is failing to maintain the software productivity of the past. In this paper we propose a pragmatic approach, motivated by our experience with BLIS [11], for building applications that tolerate changing hardware, delivering good performance from the same source across diverse parallel target. Applications are coded in terms of generic parallel patterns using a "piggy back" language, embedded into a base sequential language by attaching semantics to function calls. Our approach allows programmers to leverage multiple processor-specific and domain-specific toolchains encapsulated in specialization modules, which extract their input information from the semantics of the function calls, creating an isolation layer bewteen application and target platforms. Developers use existing sequential development tools and languages to code and debug, performing specialization as a separate step when shipping the code. We show this approach can successfully specialize a single source to diverse and evolving heterogeneous multicore targets and enable aggressive compiler optimizations.

OpenMP for Next Generation Heterogeneous Clusters
Back to List of Accepted Posters
The last years have seen great diversity in new hardware with e.g. GPUs providing multiple times the processing power of CPUs. Programming GPUs or clusters of GPUs however is still complicated and time consuming. In this paper we present extensions to OpenMP that allow one program to scale from a single multi-core CPU to a many-core cluster (e.g. a GPU cluster). We extend OpenMP with a new scheduling clause to enable developers to specify automatic tiling and library functions to access the tile size or the number of the currently calculated tile. We furthermore demonstrate that the intra-tile parallelization can be created automatically based on the inter-tile parallelization and thereby allows for scalability to shared memory many-core architectures. To be able to use OpenMP on distributed memory systems we propose a PGAS-like memory level called world memory. World memory does not only allow data to be shared among multiple processes, but also allows for fine-grained synchronization of processes. World memory has two states: initialized and uninitialized. A process reading from uninitialized memory will be suspended, until another process writes to that memory and thereby initializes it. This concept requires oversaturating the available hardware with processes.

Lock Prediction
Back to List of Accepted Posters
This paper makes the case for studying and exploiting the predictability of lock operations in multithreaded programs. We discuss several uses of lock prediction and identify the two key events we believe are useful to predict: (1) which thread will be the next one to acquire a given lock and (2) which lock a given thread will acquire next. We describe multiple prediction algorithms and show that while their accuracy varies widely, they are likely accurate enough for many uses.

Resource Management in the Tessellation Manycore OS
Back to List of Accepted Posters
Tessellation is a manycore OS targeted at the resource management challenges of emerging client devices, including the need for real-time and QoS guarantees. It is predicated on two central ideas: Space-Time Partitioning (STP) and Two-Level Scheduling. STP provides performance isolation and strong partitioning of resources among interacting software components, called Cells. Two-Level Scheduling separates global decisions about the allocation of resources to Cells from application-specific scheduling of resources within Cells. We describe Tessellation's Cell model and its resource allocation architecture. We present results from an early prototype running on two different platforms including one with memory-bandwidth partitioning hardware.

Processes and Resource Management in a Scalable Many-core OS
Back to List of Accepted Posters
The emergence of many-core architectures necessitates a redesign of operating systems, including the interfaces they expose to an application. We propose a new operating system, called ROS, designed specifically to address many limitations of current OSs as we move into the many-core era. Our goals are (1) to provide better support for parallel applications for both high-performance and general purpose computing, and (2) to scale the kernel to thousands of cores. In this paper, we focus on the process model and resource management mechanisms of ROS. We expand the traditional process model to include the notion of a 'many-core' process designed to naturally support parallel applications. Additionally, we discuss our method of resource management that builds on the ideas of Space-Time Partitioning presented in our previous work [16]. Central to our design is the notion that protection domains should not necessarily be coupled with resource management, and resource partitioning is not necessarily coupled with resource allocation.

Contention-Aware Scheduling of Parallel Code for Heterogeneous Systems
Back to List of Accepted Posters
A typical consumer desktop computer has a multi-core CPU with at least two and up to eight processing elements over two processors, and a multi-core GPU with up to 512 processing elements. Both the CPU and the GPU are capable of running parallel code, yet it is not obvious when to utilize one processor or the other because of workload considerations and, as importantly, contention on each device. This paper demonstrates a method for dynamically deciding whether to run a given parallel workload on the CPU or the GPU depending on the state of the system when the code is launched. To achieve this, we tested a selection of parallel OpenCL code on a multi-core CPU and a multi-core GPU, as part of a larger program that runs on the CPU. When the parallel code is launched, the runtime makes a dynamic decision about which processor to run the code on, given system state and historical data. We demonstrate a method for using meta-data available to the runtime and historical data from code profiling to make the dynamic decision. We outline the runtime information necessary for making effective dynamic decisions, and suggest hardware, operating system, and driver support.

Capturing and Composing Parallel Patterns with Intel CnC
Back to List of Accepted Posters
The most accessible and successful parallel tools today are those that ask programmers to write only isolated serial kernels, hiding parallelism behind a library interface. Examples include Google's Map-Reduce [5], CUDA [13], and STAPL [12]. This encapsulation approach applies to a wide range of structured, well-understood algorithms, which we call parallel patterns. Today's high-level systems tend to encapsulate only a single pattern. Thus we explore the use of Intel CnC as a single framework for capturing and composing multiple patterns.

Molatomium: Parallel Programming Model in Practice
Back to List of Accepted Posters
Consumer electronics products are adopting multi-core processors. What is important for such adoption is to balance productivity and performance with high portability. To achieve this goal, we developed Molatomium, a programming model for the multi-core paradigm. It consists of a C-like coordinate language named Mol and a C/C++ part called Atom. In practice, our latest flagship digital TV, CELL REGZA^TM, uses Molatomium to parallelize its media processing applications. In this paper, we describe Molatomuim's programming model, language constructs and performance. Also, we discuss the pitfalls we faced during the development and the future plan.

Need help? Use our Contacts page.

Last changed: 11 June 2010 jel