Sundial: Fault-tolerant Clock Synchronization for Datacenters

Authors: 

Yuliang Li, Google Inc. and Harvard University; Gautam Kumar, Hema Hariharan, Hassan Wassel, Peter Hochschild, and Dave Platt, Google Inc.; Simon Sabato, Lilac Cloud; Minlan Yu, Harvard University; Nandita Dukkipati, Prashant Chandra, and Amin Vahdat, Google Inc.

Abstract: 

Clock synchronization is critical for many datacenter applications such as distributed transactional databases, consistent snapshots, and network telemetry. As applications have increasing performance requirements and datacenter networks get into ultra-low latency, we need submicrosecond-level bound on time-uncertainty to reduce transaction delay and enable new network management applications (e.g., measuring one-way delay for congestion control). The state-of-the-art clock synchronization solutions focus on improving clock precision but may incur significant time-uncertainty bound due to the presence of failures. This significantly affects applications because in large-scale datacenters, temperature-related, link, device, and domain failures are common. We present Sundial, a fault-tolerant clock-synchronization system for datacenters that achieves ~100ns time-uncertainty bound under various types of failures. Sundial provides fast failure detection based on frequent synchronization messages in hardware. Sundial enables fast failure recovery using a novel graph-based algorithm to precompute a backup plan that is generic to failures. Through experiments in a >500-machine testbed and large-scale simulations, we show that Sundial can achieve ~100ns time-uncertainty bound under different types of failures, which is more than two orders of magnitude lower than the state-of-the-art solutions. We also demonstrate the benefit of Sundial on applications such as Spanner and Swift congestion control.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {258951,
author = {Yuliang Li and Gautam Kumar and Hema Hariharan and Hassan Wassel and Peter Hochschild and Dave Platt and Simon Sabato and Minlan Yu and Nandita Dukkipati and Prashant Chandra and Amin Vahdat},
title = {Sundial: Fault-tolerant Clock Synchronization for Datacenters},
booktitle = {14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20)},
year = {2020},
isbn = {978-1-939133-19-9},
pages = {1171--1186},
url = {https://www.usenix.org/conference/osdi20/presentation/li-yuliang},
publisher = {USENIX Association},
month = nov
}

Presentation Video