MoCE: A Mixture-of-Context Aware Experts Framework for Troubleshooting Internet-scale Services

Vipul Harsh, Conviva and Carnegie Mellon University; Sayan Sinha, Conviva and Georgia Tech; Henry Milner, Conviva; B. Aditya Prakash, Conviva and Georgia Tech; Vyas Sekar and Hui Zhang, Conviva and Carnegie Mellon University

Modern Internet-scale services need to rapidly identify root causes of customer impacting incidents and remediate them. While there are many algorithms (including LLM-assisted solutions) for root cause analysis, these have significant limitations in terms of coverage, extensibility, and scalability due to the diversity of incidents that can occur at Internet-scale and the complexity of telemetry analysis. We argue the need for a paradigm shift in root cause analysis to depart from algorithm development to a systems approach.

To this end, we introduce a mixture of context-aware experts framework where each “expert” represents a root-cause hypothesis exploration. To enable rapid development of new experts and allow computational reuse across the ensemble for scalability, we design an abstraction that allows us to express an expert as a dataflow DAG combining relational, stateful and statistical operations. To ensure scalability and extensibility, we develop a lazy DAG runtime system that lazily schedules execution of DAG nodes. We implement this idea in MoCE and demonstrate its value using a mix of real-world incident data from four large application analytics providers and synthetically generated incidents. We show many existing and novel approaches can be expressed succinctly in our framework. We find that MoCE achieves high RCA accuracy (>95%) across diverse incidents compared to 34% for the closest single expert (including prior works) achieving high coverage. We also show the value of the mixture paradigm and the lazy DAG runtime using controlled experiments.

NSDI '26 Open Access Sponsored by
King Abdullah University of Science and Technology (KAUST)

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {316622,
author = {Vipul Harsh and Sayan Sinha and Henry Milner and B. Aditya Prakash and Vyas Sekar and Hui Zhang},
title = {{MoCE}: A {Mixture-of-Context} Aware Experts Framework for Troubleshooting Internet-scale Services},
booktitle = {23rd USENIX Symposium on Networked Systems Design and Implementation (NSDI 26)},
year = {2026},
isbn = {978-1-939133-54-0},
address = {Renton, WA},
pages = {2131--2149},
url = {https://www.usenix.org/conference/nsdi26/presentation/harsh},
publisher = {USENIX Association},
month = may
}

Presentation Video