How to Not Destroy Your Production Kubernetes Clusters

Qian Ding

Wednesday, December 07, 2022 - 11:40 am–12:10 pm AEDT

Qian Ding, Ant Group

This talk presents our real production incident stories when managing hundreds of Kubernetes clusters, particularly when a single cluster scales to 10K+ nodes. We demonstrate that Kubernetes in production can be fragile if not operated skillfully. These operations can be as simple as adding a single node into the cluster or modifying a configmap used by the API server. Yet, the chain reactions of such operations may end up causing the entire clusters to stop scheduling pods or drop significant API requests. By sharing our lessons learned from these failures, we conclude with our best practices to maintain high cluster availability.

Qian works at Ant Group as a staff engineer focusing on site reliability engineering. He is the SRE tech lead of adopting Kubernetes in the Ant Group production environment. He is passionate about adopting and promoting SRE's philosophy for managing large-scale production systems. His current interest includes designing SLOs from the end-users perspective for using Kubernetes as well as using SLOs to drive reliability feature development for Kubernetes.

BibTeX

@conference {284879,
author = {Qian Ding},
title = {How to Not Destroy Your Production Kubernetes Clusters},
year = {2022},
address = {Sydney},
publisher = {USENIX Association},
month = dec
}

Download

View the slides

How to Not Destroy Your Production Kubernetes Clusters

Presentation Video