Interview with Eben Freeman on His Upcoming LISA17 Talk on Queuing Theory

Working in IT operations these days can be challenging as it seems like there’s an ever-increasing curve in terms of the knowledge you need to have. Fortunately, there are tools that help simplify what we need to do. Rather than constantly observe workloads and manually respond by building the services we think we'll need, pretty much all production-ready cloud providers and services offer autoscaling and demand-driven resources. But what are the implications of handing off this decision-making? Sure, the service in question might be taken care of, but there are systems-level concerns that we should be considering. Do you have the tools and knowledge to do that, though?

Eben Freeman

Eben Freeman is presenting "Queueing Theory in Practice: Performance Modeling for the Working Engineer" at LISA17, and I was very interested in talking to him about some of the ideas he has about this topic. I reached out to him via email to ask some questions, and I wanted to share his insights with you. I think you'll see that you really need to attend this talk. Our role isn't just to do things, but to know WHY we do things, and what the implications of those things are.

Matt Simmons: Hi Eben. Thanks so much for talking with me. Can you introduce yourself, and give us a little bit of background as to what you do at Honeycomb, and how you came into this set of responsibilities?

Eben Freeman: Sure! At Honeycomb, our goal is to help developers better understand how their services actually work in production, and bring every member of a team up to the level of its best debugger. Everyone at Honeycomb wears most of the hats, and I work on a variety of things including operations/infrastructure, integrations, and whatever else needs doing.

MS: The LISA17 conference program includes multiple talks this year that, in their abstracts, mention the Universal Scalability Law. It’s surprising to me that Neil Gunther’s "Simple Capacity Model" paper from 1993, is now generating this interest. Do you think anything has changed to bring this to the forefront, or would using these models the whole time have been advantageous?

EF: Surprising indeed! But multiple perspectives on similar themes are always interesting, and I'm looking forward to it!

I suspect that the seeming resurgence of interest in this topic might partly be because scalability questions are increasingly front of mind, even for smaller teams. For data-driven products, like the sorts of things we work on at Honeycomb, scalability matters early in the life of the company, because there's a lot of data per customer.

At the same time, cloud services make it tractable for small teams to build and scale these sorts of systems. And I think it's natural for engineers to want to approach that quantitatively, and ask, "What resources do I really need for this workload?" Although it's no magical silver bullet, queueing theory is a logical place to look for answers.

MS: I know that queueing theory can get pretty deep (pun firmly intended). What level of math familiarity should someone have to take away a good understanding from your talk?

EF: No special math background needed! I want to echo the thesis of a really fantastic talk by Jake VanderPlas in 2016 called "Statistics for Hackers": "If you can write a for loop, you can do statistics." The same applies to performance modelling—a model is of limited value if it's incomprehensible!

MS: What are some of the techniques you would recommend to instrument macroservices (which many people still have to deal with) in a way that provides transparency, or at least leverage, so that we can still make use of the models we develop?

EF: The same general principles apply for any kind of service. One thing I'd emphasize is that understanding the shape of a latency distribution is incredibly informative for any service. Do all requests take approximately the same time, or is there a long tail of slow requests?

It's tempting to eagerly aggregate metrics and only keep, say, 50th-, 90th-, and 99th-percentile latency metrics for a service. But those numbers might hide a big part of the story. It turns out that strategies that make sense for a near-constant distribution are bad for heavy-tailed distributions and vice versa. So if you don't have the ability to quickly visualize that latency distribution (and then dig into individual outliers to figure out why they were slow), that's probably worth investing in.

MS: Most IT Operations folks don’t have mathematical degrees. Do you know of a good resource for learning enough background on the mathematics of queueing theory so that it can be applied by the majority of Ops people?

EF: Ah, that's a great question and I don't know if there's a single best reference. Neil Gunther's Guerilla Capacity Planning is of course a good practically-minded book on the subject. Those looking for a very thorough treatment of the mathematics might enjoy Mor Harchol-Balter's textbook Performance Modeling and Design of Computer Systems, which covers all the theory details with a view towards actual applications for computer systems. For a more tractable overview, Baron Schwartz has some terrific blog posts on the subject, so his talk is sure to be a highlight at LISA.

I also think that a lot of the value in queueing theory is less in mastering some gnarly math, and more in taking its fundamental ideas as a guide to asking good questions and making reasonable design decisions. By way of analogy, not every software engineer should be an algorithms expert, but it's still helpful to recognize, "This code has quadratic runtime, but it should be linear instead," for example.

MS: What do you think the most interesting nascent technologies and techniques will be over the next six to 12 months?

EF: In operations engineering, I'm excited to see advanced service proxies like Envoy become more widely deployed. It's very compelling to have consistent observability, load balancing, service discovery, fault injection, and so on, across all your services, without having to cobble together all those pieces every time. Looking forward to seeing how that technology evolves!

MS: Eben, thank you so much for your time.

Folks, you can catch Eben's talk on Wednesday, November 1, at 2:30 pm in the "Talks II" room.