Cross-System Interaction Failures: Don't Fail through the Cracks

Wednesday, March 20, 2024 - 11:05 am11:50 am

Tianyin Xu and Xudong Sun, University of Illinois Urbana–Champaign


Modern cloud systems are orchestrated by many independent and interacting subsystems, each specializing in important services such as scheduling, data processing and storage, resource management, etc. Hence, overall cloud system reliability is determined not only by the reliability of each individual subsystem, but also by the interactions between them. With recent practices of microservice and serverless architectures, each individual subsystem becomes simpler and smaller, while their interactions grow in complexity and diversity. We observe that many recent production incidents of large-scale cloud systems are manifested through failures of the interactions across system boundaries, which we term "cross-system interaction failures", or "CSI failures". However, understanding and addressing such failures requires new techniques and practices that are often unavailable or under-developed.

In this talk, we will present our recent work on understanding and preventing cross-system interaction failures. Specifically, we will characterize the many faces of cross-system interactions and their diverse failure modes in modern cloud-native system stacks. We will then discuss the gaps in the current practices and existing techniques in the context of software testing and verification. Lastly, we will present some of the new techniques we developed at the University of Illinois at Urbana-Champaign to address cross-system interaction failures.

Tianyin Xu is an Assistant Professor of Computer Science at the University of Illinois at Urbana-Champaign (UIUC). His research focuses on techniques for design and implementation of reliable computer systems, especially those that operate at the cloud and datacenter scale. He has been in the UIUC List of Teachers Ranked as Excellent for five times since he joined the CS department in 2018. He is the recipient of a Jay Lepreau Best Paper Award at OSDI 2016, a Best Paper Award at ASPLOS 2020, a Best Student Paper Award at SIGCOMM 2021, two SIGSOFT Distinguished Paper Awards at ISSTA 2021 and FSE 2021, a Gilles Muller Best Artifact Award at EuroSys 2023, and a CACM Research Highlight. He is a recipient of NSF CAREER Award, an Intel Rising Star Faculty Award, a Facebook Distributed Systems Research award, and a Doctoral Award for Research at the Department of CSE at the University of California San Diego. He is an editor of the SIGOPS Blog and is an area chair of the Journal of Systems Research. More details can be found on his webpage:

Xudong Sun is a 5th-year Ph.D. student in the Computer Science Department at the University of Illinois at Urbana–Champaign. He is broadly interested in any topics related to system correctness and reliability. Currently, his research focuses on (1) improving the quality of existing cloud systems using systematic testing and (2) building provably correct cloud systems using formal verification.

