Davide Rovelli, Università della Svizzera Italiana and SAP SE; Pavel Chuprikov, Télécom Paris and Institut Polytechnique de Paris; Philipp Berdesinski, turba; Ali Pahlevan, SAP SE; Patrick Jahnke, turba; Patrick Eugster, Università della Svizzera Italiana
Failure detection is one of the most fundamental primitives on which distributed fault tolerant services and applications rely to achieve liveness. Typical failure detectors resort to using timeouts that have to take into account the unpredictability in interaction times among remote processes, caused by resource contention in the network and in endhost processors. While modern (gray) failure detectors have improved in detecting a wide range of failures, the problem of prohibitively large and unreliable timeouts for crash failures still persists, hampering performance of both the failure detector themselves and modern μs-scale services sitting on top.
We propose a novel [f]ully rel[i]able failure- [de]tector (FiDe) that can report the crash of a remote process in a datacenter within less than 30 μs ( 7.2× faster than the current state of the art) with extremely high reliability, thanks to a ground-up design which provides stable end-to-end process interactions. By reliably lowering worst-case crash detection time, FiDe enables a class of algorithms that can be used to boost coordination services even in the absence of failures. We devise two novel, FiDe-based, highly efficient consensus protocols and integrate them into a key-value store and a synchronization service, improving throughput by up to 2.23× and reducing latency down to 0.46×.
USENIX ATC '25 Open Access Sponsored by
King Abdullah University of Science and Technology (KAUST)
Open Access Media
USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

author = {Davide Rovelli and Pavel Chuprikov and Philipp Berdesinski and Ali Pahlevan and Patrick Jahnke and Patrick Eugster},
title = {{FiDe}: Reliable and Fast Crash Failure Detection to Boost Datacenter Coordination},
booktitle = {2025 USENIX Annual Technical Conference (USENIX ATC 25)},
year = {2025},
isbn = {978-1-939133-48-9},
address = {Boston, MA},
pages = {765--788},
url = {https://www.usenix.org/conference/atc25/presentation/rovelli},
publisher = {USENIX Association},
month = jul
}