Real-World Configuration Management Workshop

Today was the first day at LISA10 and I was very excited to take part in the Real-World Configuration Management Workshop. About 40 people interested in configuration management attended to the workshop. This was moderated by Narayan Desai from Argonne National Laboratory and Cory Lueninghoener from Los Alamos National Laboratory.

It started with a quick introduction of each attendee, and so we learned where everyone was from (some from well known companies like Cisco, Yahoo, Orbitz or CNN.com) and what configuration management system they were using. There was a variety of expertise: many people used cfengine, some puppet or bcfg2 and I heard only one mention of chef (ok, I have to admit that was probably me). Most people had a configuration management tool in place, but there were some that had their own custom tools, while others didn't use anything just regular shell scripts and wanted to know more about configuration management tools.

There was a huge interest around the topic and as Narayan Desai mentioned this year it doubled in size since last year's workshop. Also he noticed that compared with other years when people were concerned about compliance and technical issues, this time people were more interested in social or political issues (like how do we make the developers use this, how to get changes approved faster by management, etc.).

We quickly identified some of the topics people wanted to talk about and the discussion started in the form of an open, friendly discussion where everyone that has something to say on the particular subject contributes with his experience. There were definitely people in the room that had been doing this for a long time and had a lot of experience in the field. Even so the first topic touched: tool migrations (from one configuration management to another) showed that there is no magic solution. Some have been through this and it has not been very easy, and basically you have to start from scratch.

It was interesting to find out that most people are using complex multi-os deployments that makes it more difficult to manage and configure. Having an uniform operating system deployment is a big advantage and makes it easier to maintain and manage, but not many people seemed to be able to do this for various reasons. I guess I've been fortunate to be able to work with startups that are open and can do this much easier. Another interesting point was that some people use system packages (rpms or debs) to deploy their applications as an integrated part of their configuration management system.

Many people were unhappy that there is no practical way to measure and understand each change that is pushed by the configuration management system (you can understand some actions, but they might trigger some hidden dependencies, etc.). To protect against 'bad' changes, people use code reviews or a change management board (some go through a complex configuration change request process that can take a long time) and testing/QA environments that can be used to measure the impact of a change. This is even more complex, as no change is the same; some are important, affecting live production sites, some just add a dns record or a user, and for this some people use "routine changes" that are simple changes that don't go through a board review (this, of course, until they break something and in such case they loose the routing status and will go through the normal process again).

Everyone has unique problems based on the tool they use and their own particular organization and product. For example I might build my sudoers file from puppet or chef and have 2-3 groups and don't see this as a big challenge but something trivial, while apparently at Cisco the sudoers file is about 3MB in size and apparently very complex with different levels of access for various systems.

Another important topic was about social problems: communication and interaction with developers. Some are drastic and don't allow any developers access to production machines, while others allow them access to configuration management in a restricted way (or that requires a review) allowing them to make changes. Even in such a case, sysadmins deploy usually to production or at least review and approve the change before going live.

Configuration management is much more common these days, and any good sysadmin (or one that builds reliable systems, anyway) will deploy a configuration management tool. The challenges sysadmins are facing are developer type issues, and we can learn a lot from development best practices. Developers are already on board with much better tools and technologies. Many of these challenges can be solved with the help of developers also.

How to evaluate a configuration management tool? How do you know this is best for your organization today and for the next years? These are important questions that are not easily answered. Everyone had a personal preference for the tool they used, but there are some ideas to help people trying to make such a decision: documentation, ease of use, features, how healthy is the community behind the tool? That last point, the community, is very important and usually overlooked until you have a problem that you can't solve and need to go to a mailing list or irc channel for your answers. Based on cfengine users, its community is hard to get answers to problems, and in most cases you have to figure it out by yourself. BCFG2, Puppet and Chef seem to have a friendlier community, especially using their irc channels. Some seemed to worry about how to scale their configuration management system, but Narayan Desai was not worried how to scale to many systems, but was worried on how to scale configuration management systems to many system administrators.

Orchestration and command and control, cloud computing and virtualization were the next fields touched. This opened the discussion about golden images and how they fit into configuration management. Some people like to put as much as possible in the image, while others like to bake in the base image as little as possible to bootstrap the configuration management and run that. I certainly agree with the later one, and where it does make sense we can definitely put more things into the image (slow and time consuming ones in general) to speed up the process, but overall most of the configs will come from the first configuration management run. It is obvious to everyone that we are moving towards having systems being self service and auto provisioned as needed.

At the end of the workshop, each person had a closing where we had to share one good experience on something cool solved with our configuration management tool. It is obvious that configuration management is going mainstream and it allows system administrators to be more effective, and to support a much larger number of servers with the same amount of people. Also, there was a question about what features are missing from configuration management tools, and in general people wanted to see diffs between systems, diffs between changes, and also to know the status of their machines, are they compliant or not? Also some better frontends and dashboards would be nice.

In conclusion, I think this was a great workshop where people talked about their configuration management issues and shared what works for them and what problems they are facing. Culture and social problems were also everywhere, and everyone took their own approach to solve them. I also liked it a lot that this was not at all tool specific, and even if many limitations and problems were caused by the particular tools in use, this was kept at a generic level and not going into "you should use this other tool and this will fix all your issues" type of solutions.

If you are interested in this topic you might also like to checkout the tech talks on Wednesday "A Survey of System Configuration Tools" and Friday "Configuration Management for Mac OS X: It's Just Unix, Right?" and the BOFs: "Getting Started with Configuration Management", "Puppet BoF" and the "Bcfg2 BoF"