Improving Kafka Resilience - Gray Failures Mitigation

Wednesday, 11 October, 2023 - 16:0016:40

Michelle Valentinova, New Relic


We’ve had many problems with a single partially healthy broker causing disproportionate issues for Kafka processing that we never expected.

We cover in-depth the different scenarios that allowed this to happen and the configuration we had chosen with the best intentions that made these outages possible or even made them worse.

The outages vary from shallow broker health checks combined with storage timeouts and a certain producer configuration leading to a 20+ minute full service outage caused by a single partially healthy broker. In a different scenario simply trying to consume data from a broker in the same availability zone results in blocked processing after a broker reboots in the same AZ as the consumers. And the most complex one - solving the issue with partially healthy brokers when producing with a partition key.

We will provide a summary of the changes that allowed us to make Kafka more resilient and still have it configured to meet business needs.

Michelle Valentinova[node:field-speakers-institution]

Michelle Valentinova started her career as a Backend Web Developer. She refocused on Systems Engineering in 2011 and has worked in Amazon, Schibsted Media Group, and most recently New Relic. In New Relic, Michelle is a Senior Site Reliability Engineer in the Kafka Platform Team, making sure that teams have the best possible experience using the Kafka service. Kafka is a key part of the ingestion and processing pipeline in New Relic.

@conference {292139,
author = {Michelle Valentinova},
title = {Improving Kafka Resilience - Gray Failures Mitigation},
year = {2023},
address = {Dublin},
publisher = {USENIX Association},
month = oct