Tommy Li and Vlad Seliverstov, ClickHouse
SREs are expected to ingest and analyze massive streams of metrics, logs and traces across multi-tenant Kubernetes environments. Traditional approaches rely on central message queues and/or vendor pipelines. These struggle with cost, reliability and operational overhead at the GB/s scale. In this talk, we present a practical reference architecture for a queue-less OpenTelemetry pipeline built entirely on Kubernetes using the OpenTelemetry collector, operator and OpAMP. We explore how we run a large fleet of collectors, rollout config changes safely without disrupting ingestion and handle failure without data loss. Our system uses backpressure, autoscaling and object storage for overflow to support high throughput without needing Kafka or Pulsar. You’ll learn concrete pipeline configuration, schema design choices and practices we use that make it possible to support trillions of events per day reliably at reasonable cost.

Tommy Li is a Senior Software Engineer at ClickHouse, working on the massive scale observability platform supporting ClickHouse Cloud. Prior to ClickHouse Tommy built Postgres infrastructure at scale at companies like Brex and Datadog.

Vlad Seliverstov leads the internal observability team at ClickHouse, overseeing the monitoring of ClickHouse Cloud, which handles over 200 petabytes of data and over quadrillion events. With more than a decade of SRE experience at Datadog, Dropbox, and Facebook, Vlad now focuses on building and scaling ClickHouse’s in-house observability platform.

author = {Tommy Li and Vlad Seliverstov},
title = {Reliable {OpenTelemetry} at Scale: No Queue, No Problem},
year = {2026},
address = {Seattle, WA},
publisher = {USENIX Association},
month = mar
}