HotOS X Paper
[HotOS X Final Program]
Operating Systems Should Support Business Change
Jeffrey C. Mogul
Existing enterprise information technology (IT) systems often inhibit business flexibility, sometimes with dire consequences. In this position paper, I argue that operating system research should be measured, among other things, against our ability to improve the speed at which businesses can change. I describe some of the ways in which businesses need to change rapidly, speculate about why existing IT infrastructures inhibit useful change, and suggest some relevant OS research problems.
Businesses change. They merge; they split apart; they reorganize. They launch new products and services, retire old ones, and modify existing ones to meet changes in demand or competition or regulations. Agile businesses are more likely to thrive than businesses that cannot change quickly.
A business can lack agility for many reasons, but one common problem (and one that concerns us as computer scientists) is the inflexibility of its IT systems. ``Every business decision generates an IT event'' ; For example, a decision to restrict a Web site with product documentation to customers with paid-up warranties requires a linkage between that Web site and the warranty database. If the IT infrastructure deals with such ``events'' slowly, the business as a whole will respond slowly; worse, business-level decisions will stall due to uncertainty about IT consequences.
What does this have to do with operating systems? Surely the bulk of business-change problems must be resolved at or above the application level, but many aspects of operating system research are directly relevant to the significant problems of business change. (I assume a broad definition of ``operating system'' research that encompasses the entire, distributed operating environment.)
Of course, support for change is just one of many problems faced by IT organizations (ITOs), but this paper focusses on business change because it seems underappreciated by the systems software research community. We are much better at problems of performance, scale, reliability, availability, and (perhaps) security.
Inflexible IT systems inhibit necessary business changes. The failure to rapidly complete an IT upgrade can effectively destroy the value of a major corporation (e.g., ). There is speculation that the Sept. 11, 2001 attacks might have been prevented if the FBI had had more flexible IT systems [17, page 77]. Even when IT inflexibility does not contribute to major disasters, it frequently imposes costs of hundreds of millions of dollars (e.g., [13,14]).
The problem is not limited to for-profit businesses; other large organizations have similar linkages between IT and their needs for change. For example, the military is a major IT consumer with rapidly evolving roles; hospitals are subject to new requirements (e.g., HIPAA; infection tracking); universities innovate with IT (e.g., MIT's OpenCourseWare); even charities must evolve their IT (e.g., for tracking requirements imposed by the USA PATRIOT Act). The common factor is a large organization that thinks in terms of buying ``enterprise IT'' systems and services, not just desktops and servers.
IT organizations often spend considerably more money on ``software lifecycle'' costs than on hardware purchases. These costs include software development, testing, deployment, and maintenance. In 2004, 8.1% of worldwide IT spending went to server and storage hardware combined, 20.7% went to packaged software, but 41.6% went to ``services,'' including 15.4% for ``implementation'' . Even after purchasing packaged software, IT departments spend tons of money actually making it work .
Testing and deployment also impose direct hardware costs; for example, roughly a third of HP's internal servers are dedicated to these functions, and the fraction is larger at some other companies . These costs are high because these functions take far too long. For example, it can take anywhere from about a month to almost half a year for an ITO to certify that a new server model is acceptable for use across a large corporation's data centers. (This happens before significant application-level testing!)
It would be useful to know why the process takes so long, but I have been unable to discover any careful categorization of the time spent. (This itself would be a good research project.) In informal conversations, I learned that a major cause of the problem is the huge range of operating system versions that must be supported; although ITOs try to discourage the use of obsolete or modified operating systems, they must often support applications not yet certified to use the most up-to-date, vanilla release. The large number of operating system versions multiplies the amount of testing required.
Virtual machine technology can reduce the multiplication effect, since VMs impose regularity above the variability of hardware platforms. Once a set of operating systems has been tested on top of a given VM release, and that release has been tested on the desired hardware, the ITO can have some faith that any of these operating systems will probably work on that hardware with that VM layered in between. However, this still leaves open the problem of multiple versions of the VM software, and VMs are not always desirable (e.g., for performance reasons).
The long lead time for application deployment and upgrades contributes directly to business rigidity. A few companies (e.g., Amazon, Yahoo, Google) are considered ``agile'' because their IT systems are unusually flexible, but most large organizations cannot seem to solve this problem.
At this point, the reader mutters ``But, but, but ... we operating system researchers are all about `flexibility!'.'' Unfortunately, it has often been the wrong kind of flexibility.
To oversimplify a bit, the two major research initiatives to provide operating system flexibility have been microkernels (mix & match services outside the kernel) and extensible operating systems (mix & match services inside the kernel). These initiatives focussed on increasing the flexibility of system-level services available to applications, and on flexibility of operating system implementation. They did not really focus on increasing application-level flexibility (perhaps because we have no good way to measure that; see Section 6).
Outside of a few niche markets, neither microkernels nor extensible operating systems have been successful in the enterprise IT market. The kinds of flexibility offered by either technology seems to create more problems than they solve:
In contrast, VM research has led to market success. The term ``virtual machine'' is applied both to systems that create novel abstract execution environments (e.g., Java bytecodes) and those that expose a slightly abstract view of a real hardware environment (e.g., VMware or Xen ). The former model is widely seen as encouraging application portability through the provision of a standardized foundation; the latter model has primarily been viewed by researchers as supporting better resource allocation, availability, and manageability. But the latter model can also be used to standardize execution environments (as exemplified by PlanetLab  or Xenoservers ); VMs do aid overall IT flexibility.
In this section I suggest a few of the many operating system research problems that might directly or indirectly improve support for business change.
If uncontrolled or unexpected variation in the operating environment is the problem, can we stamp it out? That is, without abolishing all future changes and configuration options, can we prevent OS-level flexibility from inhibiting business-level flexibility?
One way to phrase this problem is: can we prove that two operating environments are, in their aspects that affect application correctness, 100.00000000% identical? That is, in situations where we do not want change, can we formally prove that we have ``sameness''?
Of course, I do not mean that operating systems or middleware should never be changed at all. Clearly, we want to allow changes that fix security holes or other bugs, improvements to performance scalability, and other useful changes that are irrelevant to the stability of the application. I will use the term ``operationally identical'' to imply a notion of useful sameness that is not too rigid.
If we could prove that host is operationally identical to host , then we could have more confidence that an application, once tested on host , would run correctly on host . More generally, and could each be clusters rather than individual hosts.
Similarly, if we could prove that is operationally identical to , an application tested only on might be safe to deploy on .
It seems likely that this would have to be a formal proof, or else an ITO probably would not trust it (and would have to fall back on time-consuming traditional testing methods). However, formal proof technology typically has not been accessible to non-experts. Perhaps by restricting an automated proof system to a sufficiently narrow domain, it could be made accessible to typical IT staff.
On the other hand, if an automated proof system fails to prove that and are identical, that should reveal a specific aspect (albeit perhaps one of many) in which they differ. That could allow an ITO either to resolve this difference (e.g., by adding another configuration item to an installation checklist) or to declare it irrelevant for a specific set of applications. The proof could then be reattempted with an updated ``stop list'' of irrelevant features.
It is vital that a sameness-proof mechanism cover the entire operating environment, not just the kernel's API. (Techniques for sameness-by-construction might be an alternative to formal proof of sameness, but it is hard to see how this could be applied to entire environments rather than individual operating systems.) Environmental features can often affect application behavior (e.g., the presence and configuration of LDAP services, authentication services, firewalls, etc. ). However, this raises the question of how to define ``the entire environment'' without including irrelevant details, such as specific host IP addresses, and yet without excluding the relevant ones, such as the correct CIDR configuration.
The traditional IT practice of insisting that only a few configuration variants are allowed can ameliorate the sameness problem at time of initial application deployment. However, environments cannot remain static; frequent mandatory patches are the norm. But it is hard to ensure that every host has been properly patched, especially since patching often affects availability and so must often be done in phases. For this and similar reasons, sameness can deteriorate over time, which suggests that a sameness-proof mechanism would have to be reinvoked at certain points.
Business customers are increasingly demanding that system vendors pre-configure complex systems, including software installation, before shipping them. This can help establish a baseline for sameness, but vendor processes sometimes change during a product lifetime. A sameness-proof mechanism could ensure that vendor process changes do not lead to environmental differences that would affect application correctness.
A business cannot effectively manage an IT system when it does not know how much business value that system generates. Most businesses can only estimate this value, for lack of any formal way to measure it. Similarly, a business that cannot quantify the value of its IT systems might not know when it is in need of IT-level change.
ITOs typically have budgets separate from the profit-and-loss accountability of customer-facing divisions, and thus have much clearer measures of their costs than of their benefits to the entire business. An ITO is usually driven by its local metrics (cost, availability, number of help-desk calls handled per hour). ITOs have a much harder time measuring what value its users gain from specific practices and investments, and what costs are absorbed by its users. As a result, large organizations tend to lack global rationality with respect to their IT investments. This can lead to either excessive or inadequate caution in initiating business changes. (It is also a serious problem for accountants and investors, because ``the inability to account for IT value means [that it is] not reflected on the firm's [financial reports]'', often creating significant distortions in these reports .)
Clearly, most business value is created by applications, rather than by infrastructure and utilities such as backup services . This suggests that most work on value-quantification must be application-specific; why should we think operating system research has anything to offer?
One key issue is that accounting for value, and especially in ascribing that value to specific IT investments, can be quite difficult in the kinds of heavily shared and multiplexed infrastructures that we have been so successful at creating. Technologies such as timesharing, replication, DHTs, packet-switched networks and virtualized CPUs, memory, and storage make value-ascription hard.
This suggests that the operating environment could track application-level ``service units'' (e.g., requests for entire Web pages) along with statistics for response time and resource usage. Measurements for each category of service unit (e.g., ``catalog search'' or ``shopping cart update'') could then be reported, along with direct measurements of QoS-related statistics and of what IT assets were employed. The Resource Containers abstract  provides a similar feature, but would have to be augmented to include tracking information and to span distributed environments. Magpie  also takes some steps in this direction.
Accounting for value in multiplexed environments is not an easy problem, and it might be impossible to get accurate answers. We might be limited to quantifying only certain aspects of IT value, or we might have to settle for measuring ``negative value,'' such as the opportunity cost of unavailability or delay. (An IT change that reduces a delay that imposes a clear opportunity cost has a fairly obvious value.)
Another value-related problem facing ITOs is the cost of software licenses. License fees for many major software products are based on the number of CPUs used, or on total CPU capacity. It is now widely understood that this simple model can discourage the use of technologies that researchers consider ``obviously'' good, including multi-core and multi-threaded CPUs, virtualized hardware, grid computing , and capacity-on-demand infrastructure. Until software vendors have a satisfactory alternative, this ``tax on technology innovation with little return''  could distort ITO behavior, and inhibit a ``business change'' directly relevant to our field (albeit a one-time change).
The solution to the software pricing crisis (assuming that Open Source software cannot immediately fill all the gaps) is to price based on value to the business that buys the software; this provides the right incentives for both buyer and seller. (Software vendors might impose a minimum price to protect themselves against incompetent customers.)
Lots of software is already priced per-seat (e.g., Microsoft Office and many CAD tools) or per-employee (e.g., Sun's Java Enterprise System ), but these models do not directly relate business value to software costs, and might not extend to software for service-oriented computing.
Suppose one could instead track the number of application-level service units successfully delivered to users within proscribed delay limits; then application fees could be charged based on these service units rather than on crude proxies such as CPU capacity. Also, software vendors would have a direct incentive to improve the efficiency of their software, since that could increase the number of billable service units. Such a model would require negotiation over the price per billable service unit, but by negotiating at this level, the software buyer would have a much clearer basis for negotiation.
Presumably, basing software fees on service units would require a secure and/or auditable mechanism for reporting service units back to the software vendor. This seems likely to require infrastructural support (or else buyers might be able to conceal service units from software vendors). See Section 5.5 for more discussion of auditability.
One might also want a system of trusted third-party brokers to handle the accounting, to prevent software vendors from learning too much, too soon, about the business statistics of specific customers. A broker could anonymize the per-customer accounting, and perhaps randomly time-shift it, to provide privacy about business-level details while maintaining honest charging.
Operating systems and operating environments include lots of name spaces; naming is key to much of computer systems design and innovation.1We name system objects (files, directories, volumes, storage servers, storage services), network entities (links, switches, interfaces, hosts, autonomous systems), and abstract principals (users, groups, mailboxes, messaging servers).
What happens to these name spaces when an organizations combine or establish a new peering relationship? Often these business events lead to name space problems, either outright conflicts (e.g., two servers with the same hostname) or more abstract conflicts (e.g., different designs for name space hierarchies). Fixing these conflicts is painful, slow, error-prone, and expensive. Alan Karp has articulated the need to ``design for consistency under merge'' to avoid these conflicts .
And what happens to name spaces when an organization is split (e.g., as in a divestiture)? Some names might have to be localized to one partition or another, while other names might have to continue to resolve in all partitions. One might imagine designing a naming system that supports ``completeness after division,'' perhaps through a means to tag certain names and subspaces as ``clonable.''
When systems researchers design new name spaces, we cannot focus only on traditional metrics (speed, scale, resiliency, security, etc.); we must also consider how the design supports changes in name-space scope.
IT practice increasingly tends towards outsourcing (distinct from ``offshoring'') of critical business functions. Outsourcing can increase business flexibility, by giving a business immediate access to expertise and sometimes by better multiplexing of resources, but it requires the business to trust the outsourcing provider. Outsourcing exposes the distinction between security and trust. Security is a technical problem with well-defined specifications, on which one can, in theory, do mathematical proofs. Trust is a social problem with shifting, vaque requirements; it depends significantly on memory of past experiences. Just because you can prove to yourself that your systems are secure and reliable does not mean that you can get your customers to entrust their data and critical operations to you.
This is a variant of what economists call the ``principal-agent problem.'' In other settings, a principal could establish its trust in an agent using a third-party auditor, who has sufficient access to the agent's environment to check for evidence of incorrect or improper practices. The auditor has expertise in this checking process that the principal does not, and also can investigate agents who serve multiple principals without fear of information leakage.
Pervasive outsourcing might therefore benefit from infrastructural support for auditing; i.e., the operating environment would support monitoring points to provide ``sufficient access'' to third-party auditors. Given that much outsourcing will be done at the level of operating system interfaces, some of the auditing support will come from the operating system. For example, the system might need to provide evidence to prove that principal A cannot possibly see the files of principal B, and also that this has never happened in the past.
The problems of enterprise computing, and especially of improving business-level (rather than IT) metrics, is far outside the comfort zone of most operating system researchers. Problems include
Perhaps the biggest problem is that we lack quantified metrics for things like ``business flexibility.'' (Low-level flexibility metrics, such as ``time to add a new device driver to the kernel,'' are not the right concept.) Lacking the metrics, we cannot create benchmarks or evaluate our ideas.
Rob Pike has argued that ``In a misguided attempt to seem scientific, there's too much measurement: performance minutiae and bad charts. ... Systems research cannot be just science; there must be engineering, design, and art.'' . But we must measure, because otherwise we cannot establish the value of IT systems and processes; however, we should not measure the wrong things (``performance minutiae'') simply because those are the easiest for us to measure.
Metrics for evaluating how well IT systems support business change will not be as simple as, for example, measuring Web server transaction rates, for at least two reasons. First, because such evaluations cannot be separated from context; successful change inevitably depends on people and their culture, as well as on IT. Second, because business change events, while frequent enough to be problematic, are much rarer and less repeatable than almost anything else computer scientists measure. We will have to learn from other fields, such as human factors research and economics, ways to evaluate how IT systems interact with large organizations.
I will speculate on a few possible metrics:
Section 6 describes some daunting problems. How can we possibly do research in this space? I think the answer is ``because we must.'' Support for CS research, both from government and industry, is declining [9,19]. If operating system research cannot help solve critical business problems, our field will shrink.
The situation is not dire. Many researchers are indeed addressing business-level problems. (Space prohibits a lengthy description of such work, and it would be unfair to pick out just a few.) But I think we must do better at defining the problems to solve, and at recognizing the value of their solution.
In preparing this paper, I had help from many colleagues at HP, including Martin Arlitt, Mary Baker, Terence Kelly, Bill Martorano, Jerry Rolia, and Mehul Shah. The HotOS reviewers also provided useful feedback.
This document was generated using the LaTeX2HTML translator Version 2002 (1.62)
The command line arguments were:
The translation was initiated by Jeffrey Mogul on 2005-06-29
Jeffrey Mogul 2005-06-29
This paper was originally published in the
Tenth Workshop on Hot Topics in Operating Systems (HotOS X),
June 1215, 2005, Eldorado Hotel, Santa Fe, NM
Last changed: 17 Sept. 2005 aw