This tutorial is a course in statistics with a specific focus on system administrators and the types of data they face. We assume little prior knowledge of statistics and cover the most common concepts in descriptive statistics and apply them to data taken from real life examples. Our aim is to provide insight into what methods provide good interpretation of data such as distributions, probability and formulating basic statements about the properties of observed data.
The first part will cover descriptive statistics for single datasets, including mean, median, mode, range and distributions. When discussing distributions, we will cover probabilities through percentiles (e.g., a normal distribution is very uncommon in ops data). This session will use a prepared dataset and spreadsheet (LibreOffice or OpenOffice, because it works on all platforms). We have data from the number of players from an online game over a 6-month period. In this exercise, we will analyze the distribution and try to make statements like, “What is the likelihood that we see more than 27,000 simultaneous players?” One of the lessons is that the top 5% in the distribution counts for almost a doubling in players, which is interesting. We then extend the discussion to include organizational implications: Imagine if your job is to buy resources for a service like this, and you have to double your rig in order to cope with something that is only 5% likely to happen? How would you explain it in a meeting?
The second part will discuss comparisons using two common methods that can be calculated in a spreadsheet: correlations and regressions. Correlations will be used as a tool to identify interesting relationships among data; ranked correlation may be considered for two data sets that have the same «flow» but on separate ranges (e.g., the correlation between web requests and database requests). Regression can also be used to identify relationships. For example, using a regression plot between two variables, one could identify bottlenecks by comparing the load of two tiers (db tier vs web tier). In a scalable system, we would expect a nice 45-degree linear relationship between the two. However, if the database tier struggles before the web tier, we would see the linear approximation slope «upward» (if the db load is on the y axis) as the load increases.
Throughout we will have a focus on takeaways and trying to couple the different statistical methods with the type of answers they can provide, like: “Can the average of a dataset explain the outer limits of my data?”. It is easy to fall off the wagon with a topic like statistics. We are aware of this risk and will utilize active learning techniques such as socrative and kahoot to engage the audience and make them participate more.
Sysadmins who are faced with data overload and wish they had some knowledge of how statistics can be used to make more sense of it. We assume little prior knowledge of statistics, but a basic mathematical proficiency is recommended.
- A fundamental understanding of how descriptive statistics can help provide additional insight on the data in the sysadmin world and that will allow for further self-study on statistics.
- A basic set of statistical approaches that can be used to identify fundamental properties of the data they see in their own environments, and identify patterns in that data.
- Learn how to make accurate and clear statements about their metrics that are valuable to the organization.
- Descriptive statistics for single datasets, including: mean, median, mode, range, and distributions
- Basic analysis of distributions and probabilities using percentiles typically seen in ops
- Interpretation of analyses to include team and business implications
- Regression analysis to suggest predictive relationships, with an emphasis on interpretation and implications
- Correlation analysis and broad pattern detection (if time allows)