Check out the new USENIX Web site.
Hot-ICE '11 Banner

Back to Program


Tuesday, March 29, 2011
9:00 a.m.–10:30 p.m.

Dynamic Resource Allocation for Spot Markets in Clouds
Back to Program
Cloud computing promises on-demand provisioning of resource to applications and services. To deal with dynamically fluctuating resource demands, market-driven resource allocation has been proposed and recently implemented by commercial cloud providers like Amazon EC2. In this environment, cloud resources are offered in distinct types of virtual machines (VMs) and the cloud provider runs a continuous market-driven mechanism for each VM type with the goal of achieving maximum revenue over time. However, as demand of each VM type can fluctuate independently at run time, it becomes a challenging problem to dynamically allocate data center resources to each spot market to maximize cloud provider's total revenue. In this paper, we present a solution to this problem that consists of 2 parts: (1) market analysis for forecasting the demand for each spot market, and (2) a dynamic scheduling and consolidation mechanism that allocate resource to each spot market to maximize total revenue. As optimally allocating resources for revenue maximization is a NP-hard problem, we show our algorithms can approximate the optimal solutions to this problem under both fixed and variable pricing schemes. Simulation studies confirm the effectiveness of our approach.

On the Benefit of Virtualization: Strategies for Flexible Server Allocation
Back to Program
Network virtualization is an intriguing paradigm which loosens the ties between services and physical infrastructure. The gained flexibility promises faster innovations, enabling a more diverse Internet and ensuring coexistence of heterogeneous virtual network (VNet) architectures on top of a shared substrate. Moreover, the dynamic and demand driven allocation of resources may yield a "greener Internet" without sacrificing (or, in the presence of the corresponding migration technology: with improved!) quality-of-service (QoS) / quality-of-experience (QoE). This paper attends to a fundamental challenge in the field of network virtualization: the flexible allocation and migration of servers. As a generic use case, we consider a network operator offering a flexible service to a set of dynamic or mobile users, and we present a model that captures the main cost factors in such a system. This allows us to shed light on the benefit of the flexible allocation and the use of migration. Although our cost model is described from a network virtualization perspective, it is not limited to such architectures: similar tradeoffs exist, e.g., in classic cloud networks, in content distribution networks, in the deployment of multicast reflectors or mirrored web content, or in cache placement. Our algorithms and insights are quite general and applicable to various scenarios, ranging from business applications such as SAP services in the cloud, to entertainment applications such as mobile gaming. Concretely, the algorithms presented in this paper guarantee a low access latency by adapting the resources over time while taking into account the corresponding costs: communication cost, allocation cost, migration cost (e.g., service interruption), and cost of running the servers. The algorithms come in two flavors, exploring the extremal perspectives: online algorithms where allocation decisions are done without any information on future requests, and offline algorithms where the (e.g., periodic) demand is known ahead of time. Both algorithms are applicable to various delay models (access latency, delay due to different load functions, etc.). Moreover, we also describe an optimal offline but static algorithm that allows us to quantify the cost-benefit tradeoffs of dynamic resource allocation, and thus to shed light on fundamental questions such as the use of migration compared to solutions using static servers. For example, our simulations show that the overall cost can be higher (by up to hundred percent), if resources are static, in particular if the demand dynamics is moderate.

Cost-Aware Live Migration of Services in the Cloud
Back to Program
Live migration of virtual machines is an important component of the emerging cloud computing paradigm. While live migration provides extreme versatility of management, it comes at a price of degraded service performance during migration. The bulk of studies devoted to live migration of virtual machines focus on the duration of the copy phase as a primary metric of migration performance. While shorter down times are clearly desirable, the pre-copy phase imposes an overhead on the infrastructure that may result in severe performance degradation of the migrated and collocated services offsetting the benefits accrued through live migration. We observe that there is a non-trivial trade-off between minimizing the copy phase duration and maintaining an acceptable quality of service during the pre-copy phase, and introduce a new model to quantify this trade-off. We then show that using our model, an optimal migration schedule can be efficiently calculated. Finally, we simulate, using real traces, live migrations of a virtual machine running a web server and compare the migration cost using our algorithm and commonly used live-migration methods.

Server Operational Cost Optimization for Cloud Computing Service Providers over a Time Horizon
Back to Program
Service providers are migrating to on-demand cloud computing services to unburden the task of managing infrastructure, while cloud computing providers expand the number of servers in their data centers because of the increase in load. With this growing need, their energy consumption increases significantly. Conserving energy and reducing the operational cost while satisfying the service level agreement (SLA) becomes important in order to reduce both carbon emissions and the budget for cloud computing providers. On the other hand, the aggregated demands for different services are dynamic over a time horizon. We present a multi-time period optimization model for saving the operational cost by combining two factors: 1)Dynamic Voltage/Frequency Scaling (DVFS), 2)turning servers on/off over a time horizon. We show the impact of the granularity of the duration of the time slots and frequency options on optimal solutions. A parametric study on varying cost of turning servers on/off and power consumption is also presented.

11:00 a.m.–12:15 p.m.

Automated Incident Management for a Platform-as-a-Service Cloud
Back to Program
Cloud-based offerings such as Infrastructure-as-a-service (IaaS), Platform-as-a-Service (PaaS), and Software-as-a-Service (SaaS), are being delivered by various vendors at highly competitive prices to encourage a paradigm shift to utility computing. To optimize the operational costs of managing an IBM Cloud-based PaaS offering, a two-pronged approach has been adopted: simplification of enterprise-class data center management processes currently used in IBM's Global Services Strategic Outsourcing accounts, and automation of the simplified processes. This paper describes a framework that the authors have developed to deliver an integrated monitoring and event correlation system, and an event-driven Automated Incident Management System, for IBM's Smart Business Dev/Test Cloud offering.

QoSaaS: Quality of Service as a Service
Back to Program
QoSaaS is a new framework that provides QoS information portals for enterprises running real-time applications. By aggregating measurements from both end clients and application servers, QoSaaS derives the quality information of individual system components and enables system operators to discover and localize quality problems; end-users to be hinted with expected experience; and the applications to make informed adaptations. Using a large-scale commercial VoIP system as a target application, we report preliminary experiences of developing such service. We also discuss remaining challenges and outline potential directions to address them.

KnowOps: Towards an Embedded Knowledge Base for Network Management and Operations
Back to Program
The domain knowledge required to manage and operate modern communications networks is still largely captured in human-readable documents. In this paper we take the position that an embedded machine readable knowledge base that directly supports network management and operations systems is required. We present a framework for such an approach, called KnowOps, and illustrate how it complements and enhances state-of-the-art network management and operation systems.

1:30 p.m.–2:45 p.m.

Using Hierarchical Change Mining to Manage Network Security Policy Evolution
Back to Program
Managing the security of complex cloud and networked computing environments requires crafting security policy—ranging from natural-language text to highly-structured configuration rules, sometimes multi-layered—specifying correct system behavior in an adversarial environment. Since environments change and evolve, managing security requires managing evolution of policies, which adds another layer, the change log. However, evolution increases complexity, and the more complex a policy, the harder it is to manage and update, and the more prone it is to be incorrect. This paper proposes hierarchical change mining, drawing upon the tools of software engineering and data mining, to help practitioners introduce fewer errors when they update policy. We discuss our approach and initial findings based on two longitudinal real-world datasets: low-level router configurations from Dartmouth College and high-level Public Key Infrastructure (PKI) certificate policies from the International Grid Trust Federation (IGTF).

Towards Automated Identification of Security Zone Classification in Enterprise Networks
Back to Program
Knowledge of the security zone classification of devices in an enterprise information technology (IT) infrastructure is essential in many enterprise IT transformation and optimization activities. We describe a systematic and semi-automated approach for discovering the security zone classification of devices in an enterprise network. For reduced interference with normal operation of the IT infrastructure, our approach is structured in stages, each consisting of two phases: one phase involves collecting information about actually allowed network flows, followed by an analysis phase. As part of our approach, we describe an elimination-based inference algorithm. We also present an alternative to the algorithm based on the Constraint Satisfaction Problem, and explore trade-offs between the two. Using a case study, we demonstrate the validity of our approach.

Simplifying Manageability, Scalability and Host Mobility in Large-Scale Enterprise Networks using VEIL-click
Back to Program
The explosive growth in the network driven services and devices is causing existing networks to continually expand to accommodate the demands set by them. However, underlying network architecture can not sustain this continual expansion. As a result, several ad-hoc mechanisms are used as workarounds, which make the networks increasingly complicated and difficult to manage. In this paper, we present veil-click, which is aimed at simplifying the management of large-scale enterprise networks by requiring minimal manual configuration overheads. It makes it tremendously easy to plug-in a new routing-node or a host-device in the network without requiring any manual configuration. It builds on top of a highly scalable and robust routing substrate provided by VIRO, and supports many advanced features such as seamless mobility support, built-in multi-path routing and fast-failure re-routing in case of link/node failures. Our current prototype of veil-click is built using Click Modular Router framework, and is being deployed in our lab for the evaluation under the real traffic conditions.

3:15 p.m.–4:45 p.m.

Enabling Flow-level Latency Measurements across Routers in Data Centers
Back to Program
Detecting and localizing latency-related problems at router and switch levels is an important task to network operators as latency-critical applications in a data center network become popular. This however requires that measurement instances must be deployed at each and every router/switch in the network. In this paper, we study a partial deployment method called Reference Latency Interpolation across Routers (RLIR) to support network operators' requirements such as incremental deployment and small deployment complexity without losing localization granularity and estimation accuracy significantly.

OpenFlow-Based Server Load Balancing Gone Wild
Back to Program
Today's data centers host online services on multiple servers, with a front-end load balancer directing each client request to a particular replica. Dedicated load balancers are expensive and quickly become a single point of failure and congestion. The OpenFlow standard enables an alternative approach where the commodity network switches divide traffic over the server replicas, based on packet-handling rules installed by a separate controller. However, the simple approach of installing a separate rule for each client connection (or "microflow") leads to a huge number of rules in the switches and a heavy load on the controller. We argue that the controller should exploit switch support for wildcard rules for a more scalable solution that directs large aggregates of client traffic to server replicas. We present algorithms that compute concise wildcard rules that achieve a target distribution of the traffic, and automatically adjust to changes in load-balancing policies without disrupting existing connections. We implement these algorithms on top of the NOX OpenFlow controller, evaluate their effectiveness, and propose several avenues for further research.

Online Measurement of Large Traffic Aggregates on Commodity Switches
Back to Program
Traffic measurement plays an important role in many network-management tasks, such as anomaly detection and traffic engineering. However, existing solutions either rely on custom hardware designed for a specific task, or introduce a high overhead for data collection and analysis. Instead, we argue that a practical traffic-measurement solution should run on commodity network elements, support a range of measurement tasks, and provide accurate results with low overhead. Inspired by the capabilities of OpenFlow switches, we explore a measurement framework where switches match packets against a small collection of rules and update traffic counters for the highest-priority match. A separate controller can read the counters and dynamically tune the rules to quickly "drill down" to identify large traffic aggregates. As the first step towards designing measurement algorithms for this framework, we design and evaluate a hierarchical heavy hitters algorithm that identifies large traffic aggregates, while striking a good balance between measurement accuracy and switch overhead.

Topology Switching for Data Center Networks
Back to Program
Emerging data-center network designs seek to provide physical topologies with high bandwidth, large bisection capacities, and many alternative data paths. Yet, existing protocols present a one-size-fits-all approach for forwarding packets. Traditionally, the routing process chooses one "best" route for each end-point pair. While some modern protocols support multiple paths through techniques like ECMP, each path continues to be selected using the same optimization metric. However, today's data centers host applications with a diverse universe of networking needs; a single-minded forwarding approach is likely to either let paths go unused, sacrificing reliability and performance, or make the entire network available to all applications, sacrificing needs such as isolation. This paper introduces topology switching to return control to individual applications for deciding best how to route data among their nodes. Topology switching formalizes the simultaneous use of multiple routing mechanisms in a data center, allowing applications to define multiple routing systems and deploy individualized routing tasks at small time scales. We introduce the topology switching abstraction and illustrate how it can provide both network efficiency and individual application performance, and admit flexible network management strategies.

? Need help? Use our Contacts page.

Back to Program
Last changed: 8 March 2011 jel