Skip to main content
USENIX
  • Conferences
  • Students
Sign in
  • Overview
  • Registration Information
  • Registration Discounts
  • Symposium Organizers
  • At a Glance
  • Calendar
  • Technical Sessions
  • Live Streaming
  • Purchase the Box Set
  • Tutorial on GENI
  • Posters and Demos
  • Sponsorship
  • Activities
  • Hotel and Travel Information
  • Services
  • Students
  • Questions?
  • Help Promote
  • For Participants
  • Call for Papers
  • Past Proceedings

sponsors

Silver Sponsor
Silver Sponsor
Silver Sponsor
Bronze Sponsor
Bronze Sponsor
Bronze Sponsor
General Sponsor
General Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor

twitter

Tweets by @usenix

usenix conference policies

  • Event Code of Conduct
  • Conference Network Policy
  • Statement on Environmental Responsibility Policy

You are here

Home » Effective Straggler Mitigation: Attack of the Clones
Tweet

connect with us

http://www.twitter.com/usenix
https://www.facebook.com/usenixassociation
http://www.linkedin.com/groups/USENIX-Association-49559/about
https://plus.google.com/108588319090208187909/posts
http://www.youtube.com/user/USENIXAssociation

Effective Straggler Mitigation: Attack of the Clones

Authors: 

Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, and Ion Stoica, University of California, Berkeley

Abstract: 

Small jobs, that are typically run for interactive data analyses in datacenters, continue to be plagued by disproportionately long-running tasks called stragglers. In the production clusters at Facebook and Microsoft Bing, even after applying state-of-the-art straggler mitigation techniques, these latency sensitive jobs have stragglers that are on average 8 times slower than the median task in that job. Such stragglers increase the average job duration by 47%. This is because current mitigation techniques all involve an element of waiting and speculation. We instead propose full cloning of small jobs, avoiding waiting and speculation altogether. Cloning of small jobs only marginally increases utilization because workloads show that while the majority of jobs are small, they only consume a small fraction of the resources. The main challenge of cloning is, however, that extra clones can cause contention for intermediate data. We use a technique, delay assignment, which efficiently avoids such contention. Evaluation of our system, Dolly, using production workloads shows that the small jobs speedup by 34% to 46% after state-of-the-art mitigation techniques have been applied, using just 5% extra resources for cloning.

Ganesh Ananthanarayanan, University of California, Berkeley

Ali Ghodsi, University of California, Berkeley

Scott Shenker, University of California, Berkeley

Ion Stoica, University of California, Berkeley

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {180304,
author = {Ganesh Ananthanarayanan and Ali Ghodsi and Scott Shenker and Ion Stoica},
title = {Effective Straggler Mitigation: Attack of the Clones},
booktitle = {10th USENIX Symposium on Networked Systems Design and Implementation (NSDI 13)},
year = {2013},
isbn = {978-1-931971-00-3},
address = {Lombard, IL},
pages = {185--198},
url = {https://www.usenix.org/conference/nsdi13/technical-sessions/presentation/ananthanarayanan},
publisher = {USENIX Association},
month = apr,
}
Download
Ananthanarayanan PDF
View the slides

Presentation Video 

Presentation Audio

MP3 Download

Download Audio

Public Summary: 

by Jaeyeon Jung

This paper focuses on straggling tasks (tasks that run much longer than others, thus increasing the latency of the corresponding jobs) in cloud frameworks, and their adverse impact on small jobs. The paper first shows (using real production traces, from Yahoo!, Facebook, and Bing) that most jobs have a small number of tasks, and therefore get affected by the stragglers. The paper then shows that existing straggler mitigations strategies are inefficient, especially dealing with small jobs.

The new idea is to proactively clone at the task-level, within a fixed resource utilization budget. The side effect of this approach is that cloned tasks can introduce additional contention within the job on intermediate data. Their system, Dolly, uses an approach called delayed assignment to address this issue. This paper presents an extensive evaluation with Facebook and Bing traces, and shows impressive reductions in overall running times of small jobs.

The reviewers uniformly felt that the paper was well executed with good ideas (intuitive, simple mechanism overall; empirically validated insight), and presented solid evaluation results.

  • Log in or    Register to post comments

Silver Sponsors

Bronze Sponsors

General Sponsors

Media Sponsors & Industry Partners

© USENIX

  • Privacy Policy
  • Contact Us