LISA '11 Session Abstracts

CONFERENCE PAPER ABSTRACTS

Tech Sessions: Wednesday, December 7 \| Thursday, December 8 \| Friday, December 9
Wednesday, December 7, 2011
11:00 a.m.–12:30 p.m.
Staging Package Deployment via Repository Management Back to Program This paper describes an approach for managing package versions and updates in a homogenous manner across a heterogenous environment by intensively managing a set of software repositories rather than by managing the clients. This entails maintaining multiple local mirrors, each of which is aimed at a different class of client: One is directly synchronized from the upstream repositories, while others are maintained from that repository according to various policies that specify which packages are to be automatically pulled from upstream (and therefore automatically installed without any local vetting) and which are to be considered more carefully—likely installed in a testing environment, for instance—before they are deployed widely. CDE: Run Any Linux Application On-Demand Without Installation Back to Program There is a huge ecosystem of free software for Linux, but since each Linux distribution (distro) contains a different set of pre-installed shared libraries, filesystem layout conventions, and other environmental state, it is difficult to create and distribute software that works without hassle across all distros. Online forums and mailing lists are filled with discussions of users' troubles with compiling, installing, and configuring Linux software and their myriad of dependencies. To address this ubiquitous problem, we have created an open-source tool called CDE that automatically packages up the Code, Data, and Environment required to run a set of x86-Linux programs on other x86-Linux machines. Creating a CDE package is as simple as running the target application under CDE's monitoring, and executing a CDE package requires no installation, configuration, or root permissions. CDE enables Linux users to instantly run any application on-demand without encountering "dependency hell". Improving Virtual Appliance Management through Virtual Layered File Systems Back to Program Managing many computers is difficult. Recent virtualization trends exacerbate this problem by making it easy to create and deploy multiple virtual appliances per physical machine, each of which can be configured with different applications and utilities. This results in a huge scaling problem for large organizations as management overhead grows linearly with the number of appliances. To address this problem, we introduce Strata, a system that combines unioning file system and package management semantics to enable more efficient creation, provisioning and management of virtual appliances. Unlike traditional systems that depend on monolithic file systems, Strata uses a collection of individual sotware layers that are composed together into the Virtual Layered File System (VLFS) to provide the traditional file system view. Individual layers are maintained in a central repository and shared across all file systems that use them. Layer changes and upgrades only need to be done once in the repository and are then automatically propagated to all virtual appliances, resulting in management overhead independent of the number of appliances. Our Strata Linux prototype requires only a single loadable kernel module providing the VLFS support and doesn't require any application or source code level kernel modifications. Using this prototype, we demonstrate how Strata enables fast system provisioning, simplifies system maintenance and upgrades, speeds system recovery from security exploits, and incurs only modest performance overhead.
2:00 p.m.–3:30 p.m.
Sequencer: Smart Control of Hardware and Software Components in Clusters (and Beyond) Back to Program Starting/stopping a whole cluster or a part of it is a real challenge considering the different commands related to various device types and manufacturers, and the order that should be respected. This article presents a solution called the sequencer that allows the automatic shutting down and starting up of clusters, subset of clusters or even data-centers. It provides two operation modes designed for ease of use and emergency conditions. Our product has been designed to be effcient and it is currently used to power on and power off one of the largest cluster in the world: the Tera-100, made of more than 4000 nodes. Automated Planning for Configuration Changes Back to Program This paper describes a prototype implementation of a configuration system which uses automated planning techniques to compute workflows between declarative states. The resulting workflows are executed using the popular combination of ControlTier and Puppet. This allows the tool to be used in unattended "autonomic" situations where manual workflow specification is not feasible. It also ensures that critical operational constraints are maintained throughout the execution of the workflow. We describe the background to the configuration and planning techniques, the architecture of the prototype, and show how the system deals with several examples of typical reconfiguration problems. Fine-grained Access-control for the Puppet Configuration Language Back to Program System configuration tools automate the configuration and management of IT infrastructures. However these tools fail to provide decent authorisation on configuration input. In this paper we apply fine-grained authorisation of individual changes on a complex input language of an existing tool. We developed a prototype that extracts meaningful changes from the language used in the Puppet tool. These changes are authorised using XACML. We applied this approach successfully on realistic access control scenarios and provide design patterns for developing XACML policies.
4:00 p.m.–5:30 p.m.
Tiqr: A Novel Take on Two-Factor Authentication Back to Program Authentication is of paramount importance for all modern networked applications. The username/password paradigm is ubiquitous. This paradigm suffices for many applications that require a relatively low level of assurance about the identity of the end user, but it quickly breaks down when a stronger assertion of the user's identity is required. Traditionally, this is where two- or multi-factor authentication comes in, providing a higher level of assurance. There is a multitude of two-factor authentication solutions available, but we feel that many solutions do not meet the needs of our community. They are invariably expensive, difficult to roll out in heterogeneous user groups (like student populations), often closed source and closed technology and have usability problems that make them hard to use. In this paper we will give an overview of the two-factor authentication landscape and address the issues of closed versus open solutions. We will introduce a novel open standards-based authentication technology that we have developed and released in open source. We will then provide a classification of two-factor authentication technologies, and we will finish with an overview of future work. Building Useful Security Infrastructure for Free Back to Program Working as a Security Engineer for a research program in the Federal government is a lot of fun, but incredibly challenging. Research, rightfully, receives the lion's share of funding, leaving very little for support services like IT and no funding for security specific activities. However, the burden of designing, implementing, analyzing, and reporting compliance to weighty government IT Security mandates like FISMA falls squarely on the IT section. Our IT staff is less than 10 people. We provide Help Desk, Linux server administration, networking (switches, IDS, firewalls, NMS), SQL Databases, SMB file shares, programming support, training, and implement in-house applications for scientific research mostly in Perl and PHP for our institute of 700-900 users. We are also responsible for reporting compliance with Federal, Institutional, and Divisional mandates to our oversight. In order to achieve all of this with a small staff, we've designed and implemented a lot of automation based on Open Source Software. We've learned how to leverage these tools to meet the needs of our institute and the requirements of those above us. As a pragmatic group with very little free time, we focus on building security tools that provide daily operational value. We simply do not have the resources to implement controls for the sake of the controls themselves. Local System Security via SSHD Instrumentation Back to Program In this paper we describe a method for near real-time identification of attack behavior and local security policy violations taking place over SSH. A rational is provided for the placement of instrumentation points within SSHD based on the analysis of data flow within the OpenSSH application as well as our overall architectural design and design principles. Sample attack and performance analysis examples are also provided.
Thursday, December 8, 2011
10:45 a.m.–12:45 a.m.
Adventures in (Small) Datacenter Migration Back to Program In May 2011, we embarked on an ambitious course—in 3 weeks: clear out a small, soon to be demolished, research datacenter containing 5 dozen research systems spanning 5 research groups and, along with a new faculty member's systems located off-site, move it all into another space suffering from 18 years of accumulated computer systems research history. We made it happen, but only after intensive pre-planning and after overcoming a number of challenges, both technical and non-technical, and suffering a moderate amount of bodily injury. We present an account of our adventures and examine our work in facilities, networking, and project management and the challenges we encountered along the way, many of which were not primarily technical in nature, and evaluate our approaches, methods, and results to extract useful lessons so that others may learn from our reckless ambition. Bringing Up Cielo: Experiences with a Cray XE6 System, or, Getting Started with Your New 140k Processor System Back to Program High Performance Computing systems are complex to stand up and integrate into a wider environment, involving large amounts of hardware and software work to be completed in a fixed timeframe. It is easy for unforeseen challenges to arise during the process, especially with respect to the integration work: sites have dramatically different environments, making it impossible for a vendor to deliver a product that exactly fits everybody's needs. In this paper we will look at the standup of Cielo, a 96-rack Cray XE6 system located at Los Alamos National Laboratory. We will examine many of the challenges we experienced while installing and integrating the system, as well as the solutions we found and lessons we learned from the process. Capacity Forecasting in a Backup Storage Environment Back to Program Managing storage growth is painful [1]. When a system exhausts available storage, it is not only an operational inconvenience but also a budgeting nightmare. Many system administrators already have historical data for their systems and thus can predict full capacity events in advance. EMC has developed a capacity forecasting tool for Data Domain systems which has been in production since January 2011. This tool analyses historical data from over 10,000 back-up systems daily, forecasts the future date for full capacity, and sends proactive notifications. This paper describes the architecture of the tool, the predictive model it employs, and the results of the implementation. Content-aware Load Balancing for Distributed Backup Back to Program When backing up a large number of computer systems to many different storage devices, an administrator has to balance the workload to ensure the successful completion of all backups within a particular period of time. When these devices were magnetic tapes, this assignment was trivial: find an idle tape drive, write what fits on a tape, and replace tapes as needed. Backing up data onto deduplicating disk storage adds both complexity and opportunity. Since one cannot swap out a filled disk-based file system the way one switches tapes, each separate backup appliance needs an appropriate workload that fits into both the available storage capacity and the throughput available during the backup window. Repeating a given client's backups on the same appliance not only reduces capacity requirements but it can improve performance by eliminating duplicates from network traffic. Conversely, any reconfiguration of the mappings of backup clients to appliances suffers the overhead of repopulating the new appliance with a full copy of a client's data. Reassigning clients to new servers should only be done when the need for load balancing exceeds the overhead of the move. In addition, deduplication offers the opportunity for content-aware load balancing that groups clients together for improved deduplication that can further improve both capacity and performance; we have seen a system with as much as 75% of its data overlapping other systems, though overlap around 10% is more common. We describe an approach for clustering backup clients based on content, assigning them to backup appliances, and adapting future configurations based on changing requirements while minimizing client migration. We define a cost function and compare several algorithms for minimizing this cost. This assignment tool resides in a tier between backup software such as EMC NetWorker and deduplicating storage systems such as EMC Data Domain.
2:00 p.m.–3:00 p.m.
Getting to Elastic: Adapting a Legacy Vertical Application Environment for Scalability Back to Program During my time in the field prior to joining Puppet Labs, I experienced several scenarios where I was asked to be prepared for so-called "elastic" operations, which would dynamically scale according to end-user demand. This demand only intensified as the notion of moving to IaaS became realistic. There's no button you hit marked "make elastic" to turn your infrastructure into an elastic cloud...rather you need to come to an understanding both of the technologies your organization uses, its tolerances for latency and downtime, as well as your platform, to get there. This paper discusses the key areas that must be addressed: organizational culture, technical policy development, and infrastructure readiness. Scaling on EC2 in a Fast-Paced Environment Back to Program Managing a server infrastructure in a fastpaced environment like a start-up is challenging. You have little time for provisioning, testing and planning but still you need to prepare for scaling when your product reaches the tipping point. Amazon EC2 is one of the cloud providers that we experimented with while growing our infrastructure from 20 servers to 500 servers. In this paper we will go over the pros and cons of managing EC2 instances with a mix of Bind, LDAP, SimpleDB and Python scripts; how we kept a smooth working process by using NFS, auto-mount and shell-scripting; why we switched from managing our instances based on tailor-made AMI/Shell-scripting to the official Ubuntu AMI, Cloud-init and puppet; and finally, we will go over some rules we had to follow carefully to be able to handle billions of daily non-static http request across multiple Amazon EC2 regions.
3:30 p.m.–5:30 p.m.
DarkNOC: Dashboard for Honeypot Management Back to Program Protecting computer and information systems from security attacks is becoming an increasingly important task for system administrators. Honeypots are a technology often used to detect attacks and collect information about techniques and targets (e.g., services, ports, operating systems) of attacks. However, managing a large and complex network of honeypots becomes a challenge given the amount of data collected as well as the risk that the honeypots may become infected and start attacking other machines. In this paper, we present DarkNOC, a management and monitoring tool for complex honeynets consisting of different types of honeypots as well as other data collection devices. DarkNOC has been actively used to manage a honeynet consisting of multiple subnets and hundreds of IP addresses. This paper describes the architecture and a number of case studies demonstrating the use of DarkNOC. A Cuckoo's Egg in the Malware Nest: On-the-fly Signature-less Malware Analysis, Detection, and Containment for Large Networks Back to Program Avatar is a new architecture devised to perform on-the-fly malware analysis and containment on ordinary hosts; that is, on hosts with no special setup. The idea behind Avatar is to inject the suspected malware with a specially crafted piece of software at the moment that it tries to download an executable. The special software can cooperate with a remote analysis engine to determine the main characteristics of the suspected malware, and choose an appropriate containment strategy, which may include process termination, in case the process under analysis turns out to be malicious, or let it continue otherwise. Augmented with additional detection heuristics we present in the paper, Avatar can also perform signature-less malware detection and containment. Auto-learning of SMTP TCP Transport-Layer Features for Spam and Abusive Message Detection Back to Program Botnets are a significant source of abusive messaging (spam, phishing, etc) and other types of malicious traffic. A promising approach to help mitigate botnet-generated traffic is signal analysis of transport-layer (i.e. TCP/IP) characteristics, e.g. timing, packet reordering, congestion, and flow-control. Prior work [4] shows that machine learning analysis of such traffic features on an SMTP MTA can accurately differentiate between botnet and legitimate sources. We make two contributions toward the real-world deployment of such techniques: i) an architecture for real-time on-line operation; and ii) auto-learning of the unsupervised model across different environments without human labeling (i.e. training). We present a "SpamFlow" SpamAssassin plugin and the requisite auxiliary daemons to integrate transport-layer signal analysis with a popular open-source spam filter. Using our system, we detail results from a production deployment where our auto-learning technique achieves better than 95 percent accuracy, precision, and recall after reception of ≈ 1,000 emails. Using Active Intrusion Detection to Recover Network Trust Back to Program Most existing intrusion detection systems take a passive approach to observing attacks or noticing exploits. We suggest that active intrusion detection (AID) techniques provide value, particularly in scenarios where an administrator attempts to recover a network infrastructure from a compromise. In such cases, an attacker may have corrupted fundamental services (e.g., ARP, DHCP, DNS, NTP), and existing IDS or auditing tools may lack the precision or pervasive deployment to observe symptoms of this corruption. We prototype a specific instance of the active intrusion detection approach: how we can use an AID mechanism based on packet injection to help detect rogue services.
Friday, December 9, 2011
9:00 a.m.–10:30 a.m.
Community-based Analysis of Netflow for Early Detection of Security Incidents Back to Program Detection and remediation of security incidents (e.g., attacks, compromised machines, policy violations) is an increasingly important task of system administrators. While numerous tools and techniques are available (e.g., Snort, nmap, netflow), novel attacks and low-grade events may still be hard to detect in a timely manner. In this paper, we present a novel approach for detecting stealthy, low-grade security incidents by utilizing information across a community of organizations (e.g., banking industry, energy generation and distribution industry, governmental organizations in a specific country, etc). The approach uses netflow, a commonly available non-intrusive data source, analyzes communication to/from the community, and alerts the community members when suspicious activity is detected. A community-based detection has the ability to detect incidents that would fall below local detection thresholds while maintaining the number of alerts at a manageable level for each day. WCIS: A Prototype for Detecting Zero-Day Attacks in Web Server Requests Back to Program This work presents the Web Classifying Immune System (WCIS) which is a prototype system to detect zero-day attacks against web servers by examining web server requests. WCIS is intended to work in conjunction with more traditional intrusion detection systems to detect new and emerging threats that are not detected by the traditional IDS database. WCIS is at its core an artificial immune system, but WCIS expands on the concept of artificial immune systems by adding a classifier for web server requests. This gives the system administrator more information about the nature of the detected threat which is not given by a traditional artificial immune system. This prototype system also seeks to improve the efficiency of an artificial immune system by employing back-end, batch processing so that WCIS can detect threats on higher capacity networks. This work shows that WCIS is able to achieve a high rate of accuracy at detecting and classifying attacks against web servers with very few false positives.
11:00 a.m.–12:30 p.m.
Automating Network and Service Configuration Using NETCONF and YANG Back to Program Network providers are challenged by new requirements for fast and error-free service turn-up. Existing approaches to configuration management such as CLI scripting, device-specific adapters, and entrenched commercial tools are an impediment to meeting these new requirements. Up until recently, there has been no standard way of configuring network devices other then SNMP and SNMP is not optimal for configuration management. The IETF has released NETCONF and YANG which are standards focusing on Configuration management. We have validated that NETCONF and YANG greatly simplify the configuration management of devices and services and still provide good performance. Our performance tests are run in a cloud managing 2000 devices. Our work can help existing vendors and service providers to validate a standardized way to build configuration management solutions. Deploying IPv6 in the Google Enterprise Network: Lessons Learned Back to Program This paper describes how we deployed IPv6 in our corporate network in a relatively short time with a small core team that carried most of the work, the challenges we faced during the different implementation phases, and the network design used for IPv6 connectivity. The scope of this document is the Google enterprise network. That is, the internal corporate network that involves desktops, offices and so on. It is not the network or machines used to provide search and other Google public services. Our enterprise network consists of heterogeneous vendors, equipment, devices, and hundreds of in-house developed applications and setups; not only different OSes like Linux, Mac OS X, and Microsoft Windows, but also different networking vendors and device models including Cisco, Juniper, Aruba, and Silverpeak. These devices are deployed globally in hundreds of offices, corporate data centers and other locations around the world. They support tens of thousands of employees, using a variety of network topologies and access mechanisms to provide connectivity. Experiences with BOWL: Managing an Outdoor WIFI Network (or How to Keep Both Internet Users and Researchers Happy?) Back to Program The Berlin Open Wireless Lab (BOWL) project at Technische Universität Berlin (TUB) maintains an outdoor WiFi network which is used both for Internet access and as a testbed for wireless research. From the very beginning of the BOWL project, we experienced several development and operations challenges to keep Internet users and researchers happy. Development challenges included allowing multiple researchers with very different requirements to run experiments in the network while maintaining reliable Internet access. On the operations side, one of the recent issues we faced was authentication of users from different domains, which required us to integrate with various external authentication services. In this paper, we present our experience in handling these challenges on both development and operations sides and the lessons we learned.
2:00 a.m.–3:30 p.m.
Why Do Migrations Fail and What Can We Do about It? Back to Program This paper investigates the main causes that make the application migration to Cloud complicated and error-prone through two case studies. We first discuss the typical configuration errors in each migration case study based on our error categorization model, which classifies the configuration errors into seven categories. Then we describe the common installation errors across both case studies. By analyzing operator errors in our case studies for migrating applications to cloud, we present the design of CloudMig, a semi-automated migration validation system with two unique characteristics. First, we develop a continual query (CQ) based configuration policy checking system, which facilitate operators to weave important configuration constraints into CQ-based policies and periodically run these policies to monitor the configuration changes and detect and alert the possible configuration constraints violations. Second, CloudMig combines the CQ based policy checking with the template based installation automation to help operators reduce the installation errors and increase the correctness assurance of application migration. Our experiments show that CloudMig can effectively detect a majority of the configuration errors in the migration process. Provenance for System Troubleshooting Back to Program System administrators use a variety of techniques to track down and repair (or avoid) problems that occur in the systems under their purview. Analyzing log files, cross-correlating events on different machines, establishing liveness and performance monitors, and automating configuration procedures are just a few of the approaches used to stave off entropy. These efforts are often stymied by the presence of hidden dependencies between components in a system (e.g., processes, pipes, files, etc). In this paper we argue that system-level provenance (metadata that records the history of files, pipes, processes and other system-level objects) can help expose these dependencies, giving system administrators a more complete picture of component interactions, thus easing the task of troubleshooting. Debugging Makefiles with remake Back to Program This paper briefly talks about the GNU Make debugger I wrote. In addition to the debugging capabilities, I added other useful features, such as: The Makefile file name and the line number inside this file are reported when referring to a target; On error: - a stack of targets that was being considered is shown, again with their locations. - the command invocation used to run \emph{make} is shown. - an option allows for entering the debugger on error; A distinction is made between shell code and shell output; a --tasks option prints a list of ``interesting'' targets. I show the use of the debugger to solve a real problem I encountered, and show how the shell commands can be written to a file so they can be debugged using a POSIX-shell debugger.

Need help? Use our Contacts page.

Last changed: 29 Sept. 2011 jel