Ateeq Sharfuddin and Xiaofan Feng, The George Washington University
To date, the GPGPU approach has been mainly utilized for academic and scientific computing, for example, for genetic algorithms, image analysis, cryptography, or password cracking. Though video cards supporting GPGPU have become pervasive, there do not appear to be any applications utilizing GPGPU for a household user. In this paper, one consumer application for GPGPU is described: utilizing GPGPUs for Desktop Search. Though the implementation is somewhat rudimentary, it still demonstrates a sizable performance gain.
To date, the GPGPU approach has been mainly utilized for academic and scientific computing, for example, for genetic algorithms, image analysis, cryptography, or password cracking. Though video cards supporting GPGPU have become pervasive, there do not appear to be any applications utilizing GPGPU for a household user. In this paper, one consumer application for GPGPU is described: utilizing GPGPUs for Desktop Search. Though the implementation is somewhat rudimentary, it still demonstrates a sizable performance gain.
Ghassan Almaless and Franck Wajsburt, UPMC Sorbonne Universités
Nowadays, single-chip cache-coherent multi-cores up to 100 cores are a reality. Many-cores of hundreds of cores are planned in the near future. Due to the large number of cores and for power efficiency reasons (performance per watt), cores become simpler with small caches. To get efficient use of parallelism offered by these architectures, applications must be multi-threads. The POSIX Threads (PThreads) standard is the most portable way to use threads across operating systems. It is also used as a low-level layer to support other portable, shared-memory, parallel environments like OpenMP. In this paper, we propose to verify experimentally the scalability of shared-memory, PThreads based, applications, on Cycle-Accurate-Bit-Accurate (CABA) simulated, 512-cores.
Nowadays, single-chip cache-coherent multi-cores up to 100 cores are a reality. Many-cores of hundreds of cores are planned in the near future. Due to the large number of cores and for power efficiency reasons (performance per watt), cores become simpler with small caches. To get efficient use of parallelism offered by these architectures, applications must be multi-threads. The POSIX Threads (PThreads) standard is the most portable way to use threads across operating systems. It is also used as a low-level layer to support other portable, shared-memory, parallel environments like OpenMP. In this paper, we propose to verify experimentally the scalability of shared-memory, PThreads based, applications, on Cycle-Accurate-Bit-Accurate (CABA) simulated, 512-cores. Using two unmodified highly multi-threads applications, SPLASH-2 FFT, and EPFilter (medical images noise-filtering application provided by Phillips) our study shows a scalability limitation beyond 64 cores for FFT and 256 cores for EPFilter. Based on hardware events counters, our analysis shows: (i) the detected scalability limitation is a conceptual problem related to the notion of thread and process; and (ii) the small per-core caches found in many-cores exacerbates the problem. Finally, we present our solution in principle and future work.
Pablo Reble, Stefan Lankes, Florian Zeitz, and Thomas Bemmerl, RWTH Aachen University
The integration of many cores per chip will lead to inefficiency of traditional multi-processor techniques. In particular, a hardware cache coherency protocol includes performance and hardware overhead, so that for a growing number of cores the coherence wall problem will become more serious.
The Single-chip Cloud Computer (SCC) is a recent research processor of a Cluster-on-Chip architecture, that waives a hardware-based coherency and possesses a network on chip technology. An attractive alternative to enable shared memory programming models on future many-core systems is the introduction of a software-oriented coherency.
Any software based approach, such as shared virtual memory (SVM), will need fast synchronization methods.The assumption is that hardware support is essential to achieve this performance. In this paper we will study and evaluate this hypothesis.
The integration of many cores per chip will lead to inefficiency of traditional multi-processor techniques. In particular, a hardware cache coherency protocol includes performance and hardware overhead, so that for a growing number of cores the coherence wall problem will become more serious.
The Single-chip Cloud Computer (SCC) is a recent research processor of a Cluster-on-Chip architecture, that waives a hardware-based coherency and possesses a network on chip technology. An attractive alternative to enable shared memory programming models on future many-core systems is the introduction of a software-oriented coherency.
Any software based approach, such as shared virtual memory (SVM), will need fast synchronization methods.The assumption is that hardware support is essential to achieve this performance. In this paper we will study and evaluate this hypothesis.
D. Didona, Instituto Superior Técnico/INESC-ID, Portugal; P. Felber and D. Harmanci, University of Neuchâtel, Switzerland; P. Romano, Instituto Superior Técnico/INESC-ID, Portugal; J. Schenker, University of Neuchâtel, Switzerland
In this paper we investigate the issue of automatically identifying the "natural" degree of parallelism of an application, i.e., the workload-specific threshold below which increasing concurrency will improve transactions throughput and over which addition concurrency will not help and might even degrade performance because of higher contention and abort rates, even if sufficient physical resources are available. Throughout this paper, we discuss the importance of adapting the concurrency level to the workload in various application settings. We provide empirical evidence of this, taken from two extreme scenarios: a shared-memory system with a low-level STM library written in C, and a distributed system with a high-level DSTM infrastructure written in Java.
In this paper we investigate the issue of automatically identifying the "natural" degree of parallelism of an application, i.e., the workload-specific threshold below which increasing concurrency will improve transactions throughput and over which addition concurrency will not help and might even degrade performance because of higher contention and abort rates, even if sufficient physical resources are available. Throughout this paper, we discuss the importance of adapting the concurrency level to the workload in various application settings. We provide empirical evidence of this, taken from two extreme scenarios: a shared-memory system with a low-level STM library written in C, and a distributed system with a high-level DSTM infrastructure written in Java. We overview two alternative self-tuning methodologies, based on on-line exploration and on model-driven performance forecasting techniques, and discuss how both approaches can be combined in order to maximize robustness and convergence speed towards optimum solutions.
Randolf Rotta, Steffen Büchner, and Jörg Nolte, Brandenburg University of Technology
Today's multi-cores and future many-cores are NUMA architectures with complex cache hierarchies and multiple memory channels. Depending on the topologies of these memory networks we find everything from true data sharing with shared caches to distributed memory architectures which just pretend to be physical shared memory systems. In fact, most many-cores are hybrid systems that exhibit the characteristics of both distributed systems and SMPs. In this paper we argue in favor of middleware platforms for many-cores. We will discuss the needed functionality in contrast to common distributed system middleware and present micro benchmarks on several architectures to substantiate our claims.
Today's multi-cores and future many-cores are NUMA architectures with complex cache hierarchies and multiple memory channels. Depending on the topologies of these memory networks we find everything from true data sharing with shared caches to distributed memory architectures which just pretend to be physical shared memory systems. In fact, most many-cores are hybrid systems that exhibit the characteristics of both distributed systems and SMPs. In this paper we argue in favor of middleware platforms for many-cores. We will discuss the needed functionality in contrast to common distributed system middleware and present micro benchmarks on several architectures to substantiate our claims.
Ettore Speziale and Michele Tartara, Politecnico di Milano
This paper describes a dynamic and lightweight compiler able to guide the execution of highly dynamic parallel programs at runtime without the need for a full-fledged Just-In-Time compiler. The proposed approach allows us to apply profitable optimizations that could not be used at compile-time due to lack of sufficient information. It also enables us to optimize code fragments multiple times, depending on the specific conditions of the execution environment at runtime, while limiting the performance overhead to a negligible amount.
This paper describes a dynamic and lightweight compiler able to guide the execution of highly dynamic parallel programs at runtime without the need for a full-fledged Just-In-Time compiler. The proposed approach allows us to apply profitable optimizations that could not be used at compile-time due to lack of sufficient information. It also enables us to optimize code fragments multiple times, depending on the specific conditions of the execution environment at runtime, while limiting the performance overhead to a negligible amount.
Gagan Gupta, Srinath Sridharan, and Gurindar S. Sohi, University of Wisconsin—Madison
Multicore processors are already ubiquitous and are now the targets of hundreds of thousands of applications. Due to a variety of reasons parallel programming has not been widely adopted to program even current homogeneous, known-resource multicore processors. Future multicore processors will be heterogeneous, be increasingly less reliable, and operate in environments with dynamically changing operating conditions. With energy efficiency as a primary goal, they will present even more parallel execution challenges to the masses of unsophisticated programmers. Rather than attempt to make parallel programming more practical by chipping away at the multitude of its known drawbacks, we argue that sequential programs and their dynamic parallel execution is a better model.
Multicore processors are already ubiquitous and are now the targets of hundreds of thousands of applications. Due to a variety of reasons parallel programming has not been widely adopted to program even current homogeneous, known-resource multicore processors. Future multicore processors will be heterogeneous, be increasingly less reliable, and operate in environments with dynamically changing operating conditions. With energy efficiency as a primary goal, they will present even more parallel execution challenges to the masses of unsophisticated programmers. Rather than attempt to make parallel programming more practical by chipping away at the multitude of its known drawbacks, we argue that sequential programs and their dynamic parallel execution is a better model. The paper outlines how to achieve: (i) dynamic parallel execution from a suitably-written statically sequential program, (ii) energy-efficient execution by dynamically and continuously controlling the parallelism, and (iii) a low-overhead precise-restartable parallel execution.
Russell Sears and Christopher Douglas, Yahoo! Research
Declarative languages have been proposed for use in concurrent and distributed system development. In this work, we argue that the primary benefits of such languages come not from their declarative nature, but instead from the design patterns they espouse.
We explain how to apply these design patterns to C and Java and present two examples: a highly concurrent transactional storage system and a distributed storage implementation. We use case studies to highlight problems in current imperative and declarative approaches.
Compared to conventional imperative approaches, the additional rigor imposed by our approach improves testability and enables a wider range of systematic optimization and parallelization techniques. We inherit these properties from the declarative languages we mimic. The resulting implementations are structured as programs in those languages would be: they consist of view maintenance routines and asynchronous event handlers.
Declarative languages have been proposed for use in concurrent and distributed system development. In this work, we argue that the primary benefits of such languages come not from their declarative nature, but instead from the design patterns they espouse.
We explain how to apply these design patterns to C and Java and present two examples: a highly concurrent transactional storage system and a distributed storage implementation. We use case studies to highlight problems in current imperative and declarative approaches.
Compared to conventional imperative approaches, the additional rigor imposed by our approach improves testability and enables a wider range of systematic optimization and parallelization techniques. We inherit these properties from the declarative languages we mimic. The resulting implementations are structured as programs in those languages would be: they consist of view maintenance routines and asynchronous event handlers.
However, our use of manually generated code allows us to leverage the full range of imperative programming techniques. In particular, performance constraints sometimes force us to use techniques such as deadlock avoidance, invariant weakening, and lock-free updates. Such techniques are unavailable in current declarative runtimes; their correctness requires reasoning beyond the capabilities of current software synthesis systems.
Over time we expect higher level languages to improve dramatically, and we hope that some of our techniques will inform their designs. However, our concerns are more immediate: one of our systems is already in production, and development of the other is underway.
Stephen J. Tarsa, Tsung-han Lin, and H.T. Kung, Harvard University
Conjugate gradient is an important iterative method used for solving least squares problems. It is compute-bound and generally involves only simple matrix computations. One would expect that we could fully parallelize such computation on the GPU architecture with multiple Stream Multiprocessors (SMs), each consisting of many SIMD processing units. While implementing a conjugate gradient method for compressive sensing signal reconstruction, we have noticed that large speed-up due to parallel processing is actually infeasible due to the high I/O cost between SMs and GPU global memory. We have found that if SMs were linearly connected, we could gain a 15x speedup by loop unrolling. We conclude that adding these relatively inexpensive neighbor connections for SMs can significantly enhance the applicability of GPUs to a large class of similar matrix computations.
Conjugate gradient is an important iterative method used for solving least squares problems. It is compute-bound and generally involves only simple matrix computations. One would expect that we could fully parallelize such computation on the GPU architecture with multiple Stream Multiprocessors (SMs), each consisting of many SIMD processing units. While implementing a conjugate gradient method for compressive sensing signal reconstruction, we have noticed that large speed-up due to parallel processing is actually infeasible due to the high I/O cost between SMs and GPU global memory. We have found that if SMs were linearly connected, we could gain a 15x speedup by loop unrolling. We conclude that adding these relatively inexpensive neighbor connections for SMs can significantly enhance the applicability of GPUs to a large class of similar matrix computations.
Timothy G. Armstrong, University of Chicago; Justin M. Wozniak, Michael Wilde, and Ketan Maheshwari, Argonne National Laboratory; Daniel S. Katz, University of Chicago and Argonne National Laboratory; Matei Ripeanu, University of British Columbia; Ewing L. Lusk, Argonne National Laboratory; Ian T. Foster, University of Chicago and Argonne National Laboratory
We present here the ExM (extreme-scale many-task) programming and execution model as a practical solution to the challenges of programing the higher-level logic of complex parallel applications on current petascale and future exascale computing systems. ExM provides an expressive, high-level functional programming model that yields massive concurrency through implicit, automated parallelism. It comprises a judicious integration of dataflow constructs, highly parallel function evaluation, and extremely scalable task generation. It directly addresses the intertwined programmability and scalability requirements of systems with massive concurrency, while providing a programming model that may be attractive and feasible for systems of much lower scale. We describe here the benefits of the ExM programming and execution model, its potential applications, and the performance of its current implementation.
We present here the ExM (extreme-scale many-task) programming and execution model as a practical solution to the challenges of programing the higher-level logic of complex parallel applications on current petascale and future exascale computing systems. ExM provides an expressive, high-level functional programming model that yields massive concurrency through implicit, automated parallelism. It comprises a judicious integration of dataflow constructs, highly parallel function evaluation, and extremely scalable task generation. It directly addresses the intertwined programmability and scalability requirements of systems with massive concurrency, while providing a programming model that may be attractive and feasible for systems of much lower scale. We describe here the benefits of the ExM programming and execution model, its potential applications, and the performance of its current implementation.
We introduce CnC-Python, an implementation of the Concurrent Collections (CnC) programming model for Python computations. Python has been gaining popularity in multiple domains because of its expressiveness and high productivity. However, exploiting multicore parallelism in Python is comparatively tedious since it requires the use of low-level threads or multiprocessing modules. CnC-Python, being implicitly parallel, avoids the use of these low-level constructs, thereby enabling Python programmers to achieve task, data and pipeline parallelism in a declarative fashion while only being required to describe the program as a coordination graph with serial Python code for individual steps. The CnC-Python runtime requires that Python objects communicated between steps be serializable (picklable), but imposes no restriction on the Python idioms used within the serial code.
We introduce CnC-Python, an implementation of the Concurrent Collections (CnC) programming model for Python computations. Python has been gaining popularity in multiple domains because of its expressiveness and high productivity. However, exploiting multicore parallelism in Python is comparatively tedious since it requires the use of low-level threads or multiprocessing modules. CnC-Python, being implicitly parallel, avoids the use of these low-level constructs, thereby enabling Python programmers to achieve task, data and pipeline parallelism in a declarative fashion while only being required to describe the program as a coordination graph with serial Python code for individual steps. The CnC-Python runtime requires that Python objects communicated between steps be serializable (picklable), but imposes no restriction on the Python idioms used within the serial code. Programs written in CnC-Python are deterministic in that they produce the same outputs for the same inputs regardless of the execution schedules used by the runtime. Our implementation of CnC-Python uses the CnC Habanero-Java (HJ) runtime system, the Babel compiler to generate glue code while invoking Python from Java, and the multiprocessing module available in standard distributions of Python. However, the CnC-Python user need not be aware about Java, Babel, HJ,or any other component in our runtime to use our system. The HJ-based implementation allows us to reuse a mature CnC runtime for scheduling, and enables us to bypass the well publicized performance issues with the Python interpreter’s global interpreter lock.
Vikram Adve, Sarita Adve, Rakesh Komuravelli, Matthew D. Sinclair, and Prakalp Srivastava, University of Illinois at Urbana-Champaign
Developing software applications for emerging and future heterogeneous systems with diverse combinations of hardware is significantly harder than for homogeneous multicore systems. In this paper, we identify three root causes that underlie the programming challenges: (1) diverse parallelism models; (2) diverse memory architectures; and (3) diverse hardware instruction set semantics. We believe that these issues must be addressed using a language-neutral, virtual instruction set layer that abstracts away most of the low-level details of hardware, an approach we call Virtual Instruction Set Computing. Most importantly, the virtual instruction set must abstract away and unify the diverse forms of parallelism and memory architectures using only one or two models of parallelism.
Developing software applications for emerging and future heterogeneous systems with diverse combinations of hardware is significantly harder than for homogeneous multicore systems. In this paper, we identify three root causes that underlie the programming challenges: (1) diverse parallelism models; (2) diverse memory architectures; and (3) diverse hardware instruction set semantics. We believe that these issues must be addressed using a language-neutral, virtual instruction set layer that abstracts away most of the low-level details of hardware, an approach we call Virtual Instruction Set Computing. Most importantly, the virtual instruction set must abstract away and unify the diverse forms of parallelism and memory architectures using only one or two models of parallelism. We discuss how this approach can solve the root causes of the programmability challenges, illustrate the design with an example, and discuss the research challenges that arise in realizing this vision.
Andrew Herdrich, Ramesh Illikkal, Ravi Iyer, Ronak Singhal, Matt Merten, and Martin Dixon, Intel Corporation
Absolute throughput often fails to scale linearly with core count in chip multiprocessors (CMPs) due to contention in shared platform resources, including cache, memory bandwidth and busses. This nonlinear scaling is exacerbated by the addition of simultaneous multithreading (SMT) to CMPs by introducing resource contention at the pipeline resource level, and increasing the number of active threads in the system which further increases contention in shared resources, leading to a loss in performance stability and fairness.
Absolute throughput often fails to scale linearly with core count in chip multiprocessors (CMPs) due to contention in shared platform resources, including cache, memory bandwidth and busses. This nonlinear scaling is exacerbated by the addition of simultaneous multithreading (SMT) to CMPs by introducing resource contention at the pipeline resource level, and increasing the number of active threads in the system which further increases contention in shared resources, leading to a loss in performance stability and fairness.
This work introduces and evaluates a new form of source-based execution rate control at the processor pipeline level, based on instruction fetch, instruction queue and reservation station partitioning between SMT threads. The efficacy of these controls is demonstrated through experiments with SPEC workloads on a modified test version of an Intel® codename Nehalem microprocessor. This new SMT rate control is presented as a critical building block to restoring the fairness and determinism in performance once inherent in simpler uniprocessors utilizing time-slicing schedulers, and is proposed for inclusion in future microprocessors supporting SMT.
Romain Cledat, Intel Labs; Santosh Pande, Georgia Institute of Technology
The parallelization of algorithms depends in large part on the understanding of data-access patterns. Regular data-structures, such as dense arrays, make pattern analysis easier as data-dependence graphs can be built to clearly identify computations that can happen in parallel. However, an important class of algorithms, such as machine learning and search, rely on “irregular” data-structures which make heavy use of pointers (trees, sparse graphs for example). While regular data-structures have well-defined properties with regards to their layout in memory, pointers obfuscate this.
The parallelization of algorithms depends in large part on the understanding of data-access patterns. Regular data-structures, such as dense arrays, make pattern analysis easier as data-dependence graphs can be built to clearly identify computations that can happen in parallel. However, an important class of algorithms, such as machine learning and search, rely on “irregular” data-structures which make heavy use of pointers (trees, sparse graphs for example). While regular data-structures have well-defined properties with regards to their layout in memory, pointers obfuscate this.
In this paper, we argue that structure can also be found for irregular data-structures but at a different level of abstraction: the symbolic one. For example, each step of abreadth-first traversal of a binary tree on a node ‘n’ will visit its ‘left’ child pointer, followed by its ‘right’ child pointer. While the relationships between the memory addresses of ‘n’ and those pointed to by ‘left’ and ‘right’ may not exhibit a pattern, there is a definite relationship between the symbolic names ‘n’, ‘n::left’ and ‘n::right’, where ‘::’ denotes a parent-child relationship. These relationships can be used to compute data-dependence graphs just in time and therefore be able to help in determining whether two operations can run in parallel.
We present a trace-driven approach capable of identifying such relationships and motivate how the information could be used, for example, to improve optimistic parallel execution (such as STMs) as demonstrated by Cledat et al. in [3].
Justin E. Gottschlich, Gilles A. Pokam, and Cristiano L. Pereira, Intel Corporation
To reduce the complexity of debugging multithreaded programs, researchers have developed compile- and run-time techniques that automatically detect concurrency bugs. These techniques can identify a wide range of shared memory errors, but are sometimes impractical because they produce many false positives making it difficult to triage and reproduce specific bugs. To address these concerns, we introduce a control structure, called concurrent predicate (CP), which allows programmers to single out a specific bug by specifying the conditions that must be satisfied for the bug to be triggered. Using bugs from a test suite of 23 programs, applications from RADBench, and TBoost.STM, we show how CP is used to diagnose and reproduce such bugs that could not otherwise be reproduced using similar techniques.
To reduce the complexity of debugging multithreaded programs, researchers have developed compile- and run-time techniques that automatically detect concurrency bugs. These techniques can identify a wide range of shared memory errors, but are sometimes impractical because they produce many false positives making it difficult to triage and reproduce specific bugs. To address these concerns, we introduce a control structure, called concurrent predicate (CP), which allows programmers to single out a specific bug by specifying the conditions that must be satisfied for the bug to be triggered. Using bugs from a test suite of 23 programs, applications from RADBench, and TBoost.STM, we show how CP is used to diagnose and reproduce such bugs that could not otherwise be reproduced using similar techniques.
The non-uniform memory access (NUMA) architecture has attracted attention as a state-of-the art multi-core solution for addressing the practical limitations of increasing the number of cores in symmetric multiprocessing systems. In order to utilize this architecture, the scheduling of parallel applications has become an important problem.
In this study, we investigate the performance impact of co-scheduling parallel threads in the recent NUMA platform. From our evaluation, we find that certain applications are found to show significant performance improvements when co-scheduled in the same memory domain. The reason for the improvement is examined, and the use of the memory usage patterns of threads in analyzing the co-scheduling impact is advocated on the basis of the examination result.
The non-uniform memory access (NUMA) architecture has attracted attention as a state-of-the art multi-core solution for addressing the practical limitations of increasing the number of cores in symmetric multiprocessing systems. In order to utilize this architecture, the scheduling of parallel applications has become an important problem.
In this study, we investigate the performance impact of co-scheduling parallel threads in the recent NUMA platform. From our evaluation, we find that certain applications are found to show significant performance improvements when co-scheduled in the same memory domain. The reason for the improvement is examined, and the use of the memory usage patterns of threads in analyzing the co-scheduling impact is advocated on the basis of the examination result.
connect with us