Towards Parallelizing the Layout Engine of Firefox Back to Program
The Mozilla Firefox browser currently accounts for ~25% of the total web browsers market segment share, establishing itself as the second most popular browser worldwide after Microsoft's Internet Explorer. With the recent adoption of a tracing JavaScript Just-In-Time (JIT) compiler in Firefox 3.5, its performance has improved significantly, especially for web pages that make heavy use of JavaScript. Currently, the heavy performance hitter component of Firefox is its layout engine. According to our extensive performance measurements and analysis on representative applications running on Intel platforms, the layout engine of Firefox accounts for ~40% of its total execution time, with the Cascading Style Sheets (CSS) rule-matching process being the hot part of layout. We have developed a tracing and profiling mechanism that has enabled us gain in-depth insight into the CSS rule-checking runtime characteristics. This data has proven extremely helpful in determining that the CSS rule-matching component is a suitable target for parallelization. The collected trace information has also guided us to an effective parallel implementation of the Firefox CSS rule-matching component that delivers large performance speedups on popular web pages, including almost 90% of Mozilla's page-load tests that comprise ~400 distinct web pages from all over the world. For some of these tests, the entire page-load process is sped up by a factor of 1.84x when two worker threads cooperate in doing the layout. To our knowledge, this is the first fully functional parallel implementation of CSS rule matching that delivers significant speedups in a world-class browser.
Opportunities and Challenges of Parallelizing Speech Recognition Back to Program
Automatic speech recognition is an important tool that enables a wide range of current and emerging applications such as automatic transcription, multimedia content analysis, and natural human-computer interfaces. This article provides a glimpse on the opportunities and challenges that parallelism provides for automatic speech recognition and closely-related research from the point of view of speech researchers. The increasing parallelism in computing platforms opens three major possibilities for speech recognition systems: increasing recognition accuracy in non-ideal, everyday noisy environments; increasing recognition throughput in batch processing of speech data; and reducing recognition latency in real-time usage scenarios. We explain the technical challenges, the current approaches taken, and the possible directions for future research on these three areas to guide the design of efficient parallel software and hardware infrastructures.
A Balanced Programming Model for Emerging Heterogeneous Multicore Systems Back to Program
Computer systems are moving towards a heterogeneous architecture with a combination of one or more CPUs and one or more accelerator processors. Such heterogeneous systems pose a new challenge to the parallel programming community. Languages such as OpenCL and CUDA provide a program environment for such systems. However, they focus on data parallel programming where the majority of computation is carried out by the accelerators. Our view is that, in the future, accelerator processors will be tightly coupled with the CPUs, be available in different system architectures (e.g., integrated and discrete), and systems will be dynamically reconfigurable. In this paper we advocate a balanced programming model where computation is balanced between the CPU and its accelerators. This model supports sharing virtual memory between the CPU and the accelerator processors so the same data structures can be manipulated by both sides. It also supports task-parallel as well as data-parallel programming, fine-grained synchronization, thread scheduling, and load balancing. This model not only leverages the computational capability of CPUs, but also allows dynamic system reconfiguration, and supports different platform configurations. To help demonstrate the practicality of our programming model, we present performance results for a preliminary implementation on a computer system with an Intel® CoreTM i7 processor and a discrete Larrabee processor. These results show that the model's most performance-critical part, its shared virtual memory implementation, simplifies programming without hurting performance.
Collaborative Threads: Exposing and Leveraging Dynamic Thread State for Efficient Computation Back to Program
Current traditional models of parallel computing rely on the static breaking-up of computation or data: multiple parallel threads independently execute different parts of the overall computation with little or no communication between them. Under current models, inter-thread communication is limited to data that carries little semantic information about the role of the thread in the overall application. Adding such knowledge and sharing it among threads would allow a collaborative model of computation.
In this paper, we present a novel programming model where threads expose their internal higher order computational state allowing the construction of a global view of the program's computations, which we call the computational state tree (CST), enabling several optimizations. We discuss how the CST can be used by threads to orient the computation where it would be most useful or to re-use results already computed by other threads.
We present a method to extract collaborative information from the states of threads and insert it into the CST. We implement our method through a runtime and develop an API that can be used to leverage collaboration. We demonstrate how collaboration can be used to orient the computation of a SAT solver to maximize the number of satisfiable assignments found and also how collaboration through results sharing can be used to speedup a K-Means computation.
Structured Parallel Programming with Deterministic Patterns Back to Program
Many-core processors target improved computational performance by making available various forms of architectural parallelism, including but not limited to multiple cores and vector instructions. However, approaches to parallel programming based on targeting these low-level parallel mechanisms directly leads to overly complex, non-portable, and often unscalable and unreliable code. A more structured approach to designing and implementing parallel algorithms is useful to reduce the complexity of developing software for such processors, and is particularly relevant for many-core processors with a large amount of parallelism and multiple parallelism mechanisms. In particular, efficient and reliable parallel programs can be designed around the composition of deterministic algorithmic skeletons, or patterns. While improving the productivity of experts, specific patterns and fused combinations of patterns can also guide relatively inexperienced users to developing efficient algorithm implementations that have good scalability. The approach to parallelism described in this document includes both collective "data-parallel" patterns such as map and reduce as well as structured "task-parallel" patterns such as pipelining and superscalar task graphs. The structured pattern based approach, like data-parallel models, addresses issues of both data access and parallel task distribution in a common framework. Optimization of data access is important for both many-core processors with shared memory systems and accelerators with their own memories not directly attached to the host processor. A catalog of useful structured serial and parallel patterns will be presented. Serial patterns are presented because structured parallel programming can be considered an extension of structured control flow in serial programming. We will emphasize deterministic patterns in order to support the development of systems that automatically avoid unsafe race conditions and deadlock.
Separating Functional and Parallel Correctness Using Nondeterministic Sequential Specifications Back to Program
Writing correct explicitly-parallel programs can be very challenging. While the functional correctness of a program can often be understood largely sequentially, a software engineer must simultaneously reason about the nondeterministic parallel interleavings of the program's threads of execution. This complication is similarly a challenge to automated verification efforts.
Thus, we argue that it is desirable to decompose a program's correctness into its sequential functional correctness and the correctness of its parallelization. We propose achieving this decomposition by specifying the parallel correctness of a program with a nondeterministic but sequential version of the program. In particular, if a software engineer annotates the intended algorithmic nondeterminism in a program, then the program can act as its own specification in verifying the correctness of its parallelism. We can interpret the annotated program as sequential but nondeterministic, and then verify the correctness of the parallelism by showing that it introduces no additional nondeterminism.
Synchronization via Scheduling: Managing Shared State in Video Games Back to Program
Video games are a performance hungry application domain with a complexity that often rivals operating systems. These performance and complexity issues in combination with tight development times and large teams means that consistent, specialized and pervasive support for parallelism is of paramount importance. The Cascade project is focused on designing solutions to support this application domain. In this paper we describe how the Cascade runtime extends the industry standard job/task graph execution model with a new approach for managing shared state. Traditional task graph models dictate that tasks making conflicting accesses to shared state must be linked by a dependency, even if there is no explicit logical ordering on their execution. In cases where it is difficult to understand if such implicit dependencies exist, the programer would create more dependencies than needed, which results in constrained graphs with large monolithic tasks and limited parallelism.
By using the results of off-line code analysis and information exposed at runtime, the Cascade runtime automatically determines scenarios where implicit dependencies exist and schedules tasks to avoid data races. This technique is called Synchronization via Scheduling (SvS) and we present its two implementations. The first implementation uses Bloom filter based 'signatures' and the second relies on automatic data partitioning which has optimization potential independent of SvS. Our experiments show that SvS succeeds in achieving a high degree of parallelism and allows for finer grained tasks. However, we find that one consequence of sufficiently
fine-grained tasks is that the time to dispatch them exceeds their execution time, even using a highly optimized scheduler/manager. Fine-grained tasks, however, are a necessary condition for sufficient parallelism and overall performance gains, so this finding motivates further inquiry into how tasks are managed.
Get the Parallelism out of My Cloud Back to Program
The hardware trend toward multicore processors has so far been driven by technology limitations of wire delays, power efficiency, and limited capability to exploit instruction-level parallelism. Software evolution has lead to the rise of the cloud. This multicore + cloud evolution provides several challenges and has led to a call for parallelism.
In this paper, we first examine the drivers behind these trends to address three fallacies: software is driven by hardware, multicores will be everywhere, and multicore hardware implies parallelism is exposed to all developers. We first address these fallacies and then present our simple view of the future cloud-based ecosystem, based on what we refer to as data-centric concurrency.
Dynamic Processors Demand Dynamic Operating Systems Back to Program
The rise of multicore processors has lead to techniques that dynamically vary the set and characteristics of cores or threads available to the operating system. For example, Core Fusion merges multiple cores for faster processing. While the mechanics of the change, such as merging two cores into a more powerful core, can be handled by a virtualization layer, operating systems still have an interest in the exact set of cores available. For example, the read-copy-update mechanism in Linux must contact all cores to complete an update. Thus, the OS must be informed when the set of CPUs changes. We demonstrate through an analysis of a recent Linux kernel that (i) there are over 15 subsystems — each subsystem can have multiple callbacks registered — that depend on knowing the set of cores, and (ii) the existing hotplug mechanism is poorly suited to handling dynamic processors due to its poor performance and scalability. Based on this analysis, we propose two mechanisms, processor proxies and parallel and deferred hotplug to provide low-latency reconfiguration. In initial experiments, we show that we can reduce the latency of reconfiguration in Linux by 95 percent.
Design Principles for End-to-End Multicore Schedulers Back to Program
As personal computing devices become increasingly parallel multiprocessors, the requirements for operating system schedulers change considerably. Future general-purpose machines will need to handle a dynamic, bursty, and interactive mix of parallel programs sharing a heterogeneous multicore machine. We argue that a key challenge for such machines is rethinking scheduling as an end-to-end problem integrating components from the hardware and kernel up to the programming language runtimes and applications themselves.
We present several design principles for future OS schedulers, and discuss the implications of each for OS and runtime interfaces and structure. We illustrate the implementation challenges that result by describing the concrete choices we have made in the Barrelfish multikernel. This allows us to present one coherent scheduling design for an entire multicore machine, while at the same time drawing conclusions we think are applicable to the design of any general-purpose multicore OS.
OoOJava: An Out-of-Order Approach to Parallelizing Java Back to Program
Developing parallel software using current tools can be challenging. Developers must reason carefully about the use of locks to avoid both race conditions and deadlocks. We present a compiler-assisted approach to parallel programming inspired by out-of-order hardware. In our approach, the developer annotates code blocks as reorderable to decouple these blocks from the parent thread of execution. OoOJava uses static analysis to extract all data dependences from both variables and data structures to generate an executable that is guaranteed to preserve the behavior of the original sequential code.
We have implemented OoOJava and achieved significant speedups for a ray tracer and a K-Means cluster benchmark. The straightforward development model, compiler feedback, and speedups are promising indicators that a simple deterministic parallel programming model with strong guarantees can become mainstream
User-Defined Data Distributions in Chapel: Philosophy and Framework Back to Program
This paper introduces user-defined domain maps, a novel concept for implementing distributions and memory layouts for parallel data aggregates. Our domain maps implement parallel arrays for the Chapel programming language and are themselves implemented using standard Chapel features. Domain maps export a functional interface that our compiler targets as it maps from a user's global-view algorithm down to the task-level operations required to implement the computation for multicore processors, GPUs, and distributed memory architectures. Unlike distributions in HPF and ZPL, Chapel's domain maps are designed for generality and do not rely on hard-coding a fixed set of distributions into the compiler and runtime. The chief contributions of this paper are its description of our motivating principles and an overview of our framework.
On the Limits of GPU Acceleration Back to Program
This paper throws a small "wet blanket" on the hot topic of GPGPU acceleration, based on experience analyzing and tuning both multithreaded CPU and GPU implementations of three computations in scientific computing. These computations—(a) iterative sparse linear solvers; (b) sparse Cholesky factorization; and (c) the fast multipole method—exhibit complex behavior and vary in computational intensity and memory reference irregularity. In each case, algorithmic analysis and prior work might lead us to conclude that an idealized GPU can deliver better performance, but we find that for at least equal-effort CPU tuning and consideration of realistic workloads and calling-contexts, we can with two modern quad-core CPU sockets roughly match one or two GPUs in performance.
Our conclusions are not intended to dampen interest in GPU acceleration; on the contrary, they should do the opposite: they partially illuminate the boundary between CPU and GPU performance, and ask architects to consider application contexts in the design of future coupled on-die CPU/GPU processors.
Gossamer: A Lightweight Programming Framework for Multicore Machines Back to Program
The key to performance improvements in the multicore era is for software to utilize the available concurrency. A recent paper [3] summarizes the challenges and describes twelve so-called dwarfs—types of computing and communication patterns that occur in parallel programs. One of the key points in the paper is that a general programming model has to be able to accommodate all of the patterns defined by the dwarfs, singly or in combination. The challenge is to do so both simply and efficiently.
Parallel programming can be supported by providing a threads library [14, 15, 2]; by modifying compilers to extract parallelism or use optimistic code execution techniques [4, 6, 5]; by using concurrency features of existing languages such as Java or C#; by designing new programming languages such as Erlang [16], Fortress [24], X10 [22], and ZPL [7]; or by annotating sequential programs with directives that specify concurrency and synchronization, as in Cilk [11], Cilk++ [8], OpenMP [20], and others [25, 19, 1, 12, 17, 21].
All of these approaches are valuable and are producing useful results, but the last approach—annotating programs—has, in our opinion, the most potential to be simultaneously simple, general, and efficient. In particular, annotations are easier to use than libraries because they hide lots of bookkeeping details, and they are simpler to learn than an entire new programming language. Annotation-based approaches also have efficient implementations. However, no existing approach is general enough to support all the computational patterns (dwarfs) defined in [3].
This paper describes Gossamer, an annotation-based approach that is general as well as simple and efficient. Gossamer has three components: (1) a set of high-level annotations that one adds to a sequential program (C in our case) in order to specify concurrency and synchronization; (2) a source-to-source translator that takes an annotated sequential program and produces an optimized program that uses our threading library; and (3) a run-time system that provides efficient fine-grained threads and high-level synchronization constructs.
As will be seen, the Gossamer annotations are as simple to use as those of Cilk++, and Gossamer's performance is better than OpenMP. What sets Gossamer apart is a more extensive set of annotations that enable solving a greater variety of applications. In addition to iterative and recursive parallelism, Gossamer supports pipelined computations by a general ordering primitive, domain decomposition by means of replicated code patterns, and MapReduce [10, 26] computations by means of an efficient associative memory type.
The paper is organized as follows. Section 2 introduces our programming model and annotations by means of numerous examples. Section 3 summarizes the Gossamer translator and run-time system. Section 4 gives experimental results. Section 5 discusses related work.
Reflective Parallel Programming: Extensible and High-Level Control of Runtime, Compiler, and Application Interaction Back to Program
Thread support in most languages is opaque and low-level. Primitives like wait and signal do not allow users to determine the relative ordering of statements in different threads in advance.
In this paper, we extend the reflection and meta-programming facilities of object-oriented languages to cover parallel program schedules. The user can then access objects representing the extant threads or other parallel tasks. These objects can be used to modify or query happens before relations, locks, and other high-level scheduling information. These high-level models enable users to design their own parallel abstractions, visualizers, safety checks, and other tools in ways that are not possible today.
We discuss one implementation of this technique, the intervals library, and show how the presence of a first-class, queryable program schedule allows us to support a flexible data race protection scheme. The scheme supports both static and dynamic checks and also permits users to define their own "pluggable" safety checks based on the reflective model of the program schedule.
Task Superscalar: Using Processors as Functional Units Back to Program
The complexity of parallel programming greatly limits the effectiveness of chip-multiprocessors (CMPs). This paper presents the case for task superscalar pipelines, an abstraction of traditional out-of-order superscalar pipelines, that orchestrates an entire chip-multiprocessor in the same degree out-of-order pipelines manage functional units. Task superscalar leverages an emerging class of task-based dataflow programming models to relieve programmers from explicitly managing parallel resources. We posit that task superscalar overcome many of the limitations of instruction-level out-of-order pipelines, and provide a scalable interface for CMPs.
|