Anomaly Detection on Golden Signals

Wednesday, June 12, 2019 - 11:00 am12:00 pm

Yu Chen, Baidu


Anomaly detection on golden signals, including latency, traffic, errors, and saturation, can detect system failures and provide important clues for failure diagnosis. In this talk, we will introduce our algorithm toolbox for anomaly detection on the golden signals.

The toolbox leverages historic data from the signals to build appropriate probability models. The alerts are hence generated based on the probability calculated from the observation and the probability model. The probability directly relates to the false positive rate of classification and is able to represent the SRE engineers' feeling. Furthermore, the probability values are comparable across different signals. So, it becomes a good feature for failure diagnosis. From our production system, the alerting precision ranges from 70% to 90%, and the recall is around 90%.

Yu Chen, Baidu

Yu Chen is a Data Architect at the IOP group of Baidu’s SRE department. His work focuses on developing algorithms for alerting and diagnosis, in order to improve the stability of production systems. Previously, he worked at Microsoft Research Asia. His research interests are distributed systems, consensus protocols, search ranking, and query recommendation.

@conference {233217,
author = {Yu Chen},
title = {Anomaly Detection on Golden Signals},
year = {2019},
address = {Singapore},
publisher = {{USENIX} Association},