- FAST '14 Home
- Conference Organizers
- Registration Information
- At a Glance
- Training Program
- Technical Sessions
- Purchase the Box Set
- Students and Grants
- Help Promote!
- For Participants
- Call for Papers
- Past Conferences
You are here
Full Training Program
Half Day Morning
Mark Grover is a committer on Apache Bigtop, a committer and PMC member on Apache Sentry (incubating) and a contributor to Apache Hadoop, Apache Hive, Apache Sqoop and Apache Flume. He is also a section author of O’Reilly’s Programming Hive. Mr. Grover presently works as a software engineer at Cloudera and frequently presents on Hadoop ecosystem technologies at software conferences.
Originally inspired by Google's GFS and MapReduce papers, Apache Hadoop is an open source framework offering scalable, distributed, fault-tolerant data storage and processing on standard hardware. This session explains what Hadoop is and where it best fits into the modern data center. You'll learn the basics of how it offers scalable data storage and processing, some important "ecosystem" tools that complement Hadoop's capabilities, and several practical ways organizations are using these tools today. Additionally, you'll learn about the basic architecture of a Hadoop cluster and some recent developments that will further improve Hadoop's scalability and performance.
This session is intended for those who are new to Hadoop and are seeking to understand what Hadoop is, the ways that organizations are using it, and how it compares to and integrates with other systems. It assumes no prior knowledge of Hadoop, and explanations of technical topics like MapReduce and HDFS replication are clear and concise, making it appropriate for anyone attending the conference.
- What Hadoop is and how organizations are using it
- How the HDFS filesystem provides reliability and high throughput
- How MapReduce enables parallel processing on large data sets
- Explanations of some popular open source tools that integrate with Hadoop
- Typical architecture of a Hadoop cluster
- Considerations for hosting a Hadoop cluster
- Emerging trends in the design and implementation of Hadoop
W. David Schwaderer has a Masters Degree in Applied Mathematics from the California Institute of Technology and a MBA from the University of Southern California. He is the former Editor-in-Chief of the VERITAS Architect Network and the Symantec Technology Network. He presently consults for Samsung Semiconductor, Inc.’s Silicon Valley Systems Architecture Lab where he assists world-class engineers develop Flash memory storage innovations that will intercept your family's future.
In all his writings and training seminars, David applies Einstein's (disputed) observation that "Everything should be made as simple as possible, but not simpler." As a multidisciplinary technologist, he has authored technical books on a wide spectrum of topics ranging from data storage systems, data management, communication signaling, C Language programming, ASIC core interfacing, and Digital Image Processing. He has seven books on innovation planned following 15 years of intense research on the topic. His 12th, and possibly last, technical book, co-authored with Jason Resch, is titled "Exabyte Data Preservation, Postponing the Inevitable."
David has presented at IEEE conferences, Stanford, MIT, Intel, Google, Sun/Oracle Labs, and across Silicon Valley. His four innovation Google TechTalks on YouTube have recorded nearly 39,000 views. At his recent Joint IEEE Comsoc-CEsoc SCV presentation titled "Broadcast Storage for Video-Intensive Worlds", he was accorded the title "Silicon Valley Icon."
On a good day, Google Web searches for "W. David Schwaderer" indicate about 1.5 million hits. But on a bad day, it's only around 900,000.
Erasure Code storage applications (RAID 6, Object Storage, Information Dispersal, etc.) are all the rage, and deservedly so. They have intrinsic, engineering beauty and elegance that merit front-row seats in deep, advanced-technology discussions. But mastering Erasure Code principles can quickly prove challenging, if not impossible, because Erasure Coding's simple principles are typically steeped in academic obfuscation. This has historically presented impenetrable obstacles to uncounted intrepid, serious, and competent engineers—maybe even you. Luckily, that's totally unnecessary.
This presentation's goal is to arm aspiring, inquisitive engineers with Erasure Code foundational insights, intuition, and fundamental understandings that enable them to totally dominate Erasure Code discussions, both on their home courts and on their own terms.
Make no mistake: this session intends to be fun, but technically informative at a deep, visceral level. There will even be a Python programming demonstration, time allowing. Erasure Code principles likely will never be made more accessible than what you experience here. This is the Erasure Code train to catch; don't be left behind.
- Numbers, Counting Ducks, Clubs, and Special Club Members Such as 0 and 5
- Elementary School Arithmetic—Addition and Multiplication
- Powers and Inverse Powers—2x2x2x2 = 2^4
- Solving High School Equations—Determining Apple and Banana Prices
- The Parallel Universes Around Us—Star Trek Stuff or Just GF(N)s?
Half Day Afternoon
Jeff Darcy (S3) has worked on network and distributed storage systems for 20 years, including an instrumental role in developing MPFS (a precursor of modern pNFS) while at EMC. He is currently a member of the GlusterFS architecture team at Red Hat and frequently gives talks and tutorials about topics related to cloud storage.
Cloud storage has become an important part of both the way that modern compute clouds are built and the service that they provide for users. This tutorial will explain what cloud storage systems have in common and what makes each one different, enabling attendees to select or build the right system for their specific needs.
Primarily, people who wish to implement their own task-specific cloud storage systems. Secondarily, those who wish to understand the tradeoffs implicit in existing cloud storage systems.
- Types of cloud storage: service for a cloud provider, service for cloud users, or service for consumers
- Tradeoffs between consistency, performance, and availability
- Special requirements: security and privacy, legal and regulatory compliance
- Common techniques: membership and leader election, consistent hashing, vector clocks, Merkle trees, Bloom filters
- Case studies: existing systems representing different tradeoffs and techniques
Brent Welch is a senior software engineer at Google. He was Chief Technology Officer at Panasas and has also worked at Xerox-PARC and Sun Microsystems Laboratories. Brent has experience building software systems from the device driver level up through network servers, user applications, and graphical user interfaces. While getting his Ph.D. at UC Berkeley, Brent designed and built the Sprite distributed file system. He is the creator of the TclHttpd web server, the exmh email user interface, and the author of Practical Programming in Tcl and Tk.
This tutorial is oriented toward administrators and developers who manage and use HPC systems, and especially for those involved with storage systems in these environments. Storage is often a critical part of the HPC infrastructure. An important goal of the tutorial is to give the audience the foundation for effectively comparing different storage system options, as well as a better understanding of the systems they already have.
Cluster-based parallel storage technologies are used to manage millions of files, thousands of concurrent jobs, and performance that scales from 10s to 100s of GB/sec. This tutorial will examine current state-of-the-art high-performance file systems and the underlying technologies employed to deliver scalable performance across a range of scientific and industrial applications.
The tutorial starts with a look at storage devices and SSDs, in particular, which are growing in importance in all storage systems. Next we look at how a file system is put together, comparing and contrasting SAN file systems, scale-out NAS, and object-based parallel file system architectures.
Topics include scaling the data path, scaling metadata, fault tolerance, and manageability. Specific systems are discussed, including Lustre, GPFS, PanFS, HDFS (Hadoop File System), OpenStack, and the NFSv4.1 standard for parallel I/O. We continue up the stack to discuss MPI-IO middleware that is often used in large parallel programming environments for efficient I/O at scale.
- Scaling the data path
- Scaling metadata
- Fault tolerance
Specific systems are discussed, including Lustre, GPFS, PanFS, HDFS (Hadoop File System), OpenStack, and the NFSv4.1 standard for parallel I/O. We continue up the stack to discuss MPI-IO middleware that is often used in large parallel programming environments for efficient I/O at scale.