LISA18 Wishlist: Talks and tutorials we’d like to see
By Elizabeth K. Joseph and Brendan Gregg
The Program Committee for LISA18 has spent the past few months reaching out to potential speakers and reviewing their submissions to ultimately build a strong, diverse line-up of topics for the this year’s LISA program.
Throughout this process, we have noticed a few gaps, so we’ve put together the following wish list of topics we’d love to see covered.
Root Cause Analysis
When something goes wrong, you need to get to the bottom of it, quick. Does your team have a checklist for this? What tools are you using? Have any stories about a root cause analysis going very wrong...or very right because of rules or tooling you’ve put into place?
What’s the worst operations disaster you’ve ever experienced? Did the company recover, and how? Was there something about your processes or tooling that saved the day in the end? What did the result of the disaster do to create change in the organization?
Or maybe it’s not a disaster. Do you have a story about something that caused a dramatic shift in your organization with regard to employee happiness, infrastructure reliability, repeatable builds, or automation?
Are you hybrid-cloud, all-in cloud, or multi-cloud? Who is building the cloud images, tuning the operating system for the cloud, running configuration management, applying security patches (or just terminating and spinning up patched instances), setting up auto-scaling and red/black clusters for code deploys, designing for fault tolerance, responding to incidents and doing cloud debugging, simulating availability region outages, and everything else cloud? We want to hear about these topics: what does it take to run a successful cloud deployment in 2018 and beyond.
On-Call Rotation Tips
On-call rotations can be tough on employees. Have you or your organization has done anything to make it easier, with tooling, a satisfying on call cadence, rewards for preventative work or something else?
It’s been said that just because a server is up, that doesn’t mean the company is up. With a focus on making sure customers are seeing what you want them to see and the infrastructure is holding up under all conditions, what tools are you using to do performance analysis and make sure that your systems run well even under unexpected conditions? Such tools may be OS-centric, like Linux's perf, ftrace, bcc/eBPF, etc, or Window's ETW, Xperf, WPA, and PerfView. They may also be language-centric, or generic open source projects, like TraceCompass. Perhaps you have some flame graphs to share from Linux's perf or Window's PerfView?
Debugging sessions at operations-focused events are incredibly popular. Are you able to teach your fellow systems administrators the key things they should know about tcpdump, strace, lsof, and netstat? Or the basics around a popular programming debugger that can help when there’s an emergency and the development team that built the application can’t be reached? For example, we'd love a tutorial on gdb or lldb for basic process or kernel crash dump analysis.
Did you discover a new tool this year that changed how you work? Tell us about it.
With increasingly complicated infrastructures, security has gone beyond firewalls and file permissions. Share the processing and tools you’re using to protect your highly available infrastructure while ensuring services can still effectively communicate. We’d also enjoy seeing a talk on microservices and container security.
A cascading failure in a modern, distributed database cluster can mean disaster for your company. Do you have mitigation steps that can be taken by someone on-call to reduce risk of data loss and get your infrastructure back to production grade performance?
Microservices are a hot topic, and within that there are a number of things to dig into. Able to share details about how you’re using a Service Mesh? Or are you using a series of microservices with “sidecar” (logging, metrics) services and have details to share about how to effectively accomplish that?
First it was virtual machines, then it was containers. As the rest of us look at serverless architectures on the horizon, is your organization already doing operational work related to supporting our potential serverless future?
Machine Learning & Artificial Intelligence
Supporting data scientists with a technical platform can be more complicated than supporting unsophisticated users or developers. What have you learned when developing infrastructure for Machine Learning and Artificial Intelligence applications?
More topics are listed in the LISA18 CFP, so if you don’t see a specific topic listed here, don’t let that limit you! We’re still looking for talks in all of those areas covering architecture, culture and engineering.