Invited Talk (scribe: David Oppenheimer)
Operating System Reliability: Does Anyone Care Any More?
Andrew Hume, AT&T LabsResearch
Andrew Hume of AT&T Research gave the HotOS keynote talk, entitled
"Operating System Reliability: Does Anyone Care Any More?" He
explained that his perspective comes from having designed,
implemented, and delivered large-data applications for more than ten
years. The problems he discussed in the talk were that operating
systems have gone "from a help to a hindrance," that even users'
lowered expectations for operating systems have not been met, and that,
as a result, applications have to be designed around OS quirks. Hume
pointed out that this situation hasn't always been the case, citing
WORM-based backup systems in research versions of UNIX and a
cluster-based billing system that AT&T built using Plan 9 as
examples of systems that were highly reliable, even under load.
The first problematic system Hume described was Gecko, a large-scale
(250GB/day) billing system implemented in 1996 on Solaris 2.6. AT&T
required 1 GB/s of filesystem throughput and predictable use of
memory. Among the problems encountered were that Solaris crashed every
few days for the first six months that the system was in production, that
Solaris limited file throughput to about 600 MB/sec, that reading
large files sequentially crashed the VM, and that a "VM roller
coaster" developed when a large chunk of memory was allocated
(resulting in all of physical memory being paged out and then almost
all of the same data being paged back in, over and over again in a
cycle, rather than the system just paging out the amount of new memory
The second problematic system Hume described was a replacement for
Gecko that required 6 times the capacity of the original Gecko. This system
was implemented on a cluster running Linux. The architecture was a
"Swiss canton" model of loosely affiliated independent nodes with a
single locus of control, data replication among nodes, and a single
error path so that software could only halt by crashing (there was no
explicit shutdown operation). Hume described eight problems the Gecko
implementors experienced with Linux (versions 4.18 through 4.20),
including Linux's forcing all I/O through a filesystem buffer cache
with highly unpredictable performance scaling (30 MB/s to write to one
filesystem at a time, 2 MB/s to write to two at a time), general I/O
flakiness (1-5% of the time corrupting data read into gzip), TCP/IP
networking that was slow and behaved poorly under overload, lack
of a good filesystem, nodes that didn't survive two reboots,
and slow operation of some I/O utilities such as df. In general, Hume
said, he has concluded that "Linux is good if you want to run Apache or
compile the kernel. Every other application is suspect."
Hume proposed the following definition of OS reliability: "[The OS]
does what you ask, or it fails within a modest bounded time." He noted
that FreeBSD has comparable functionality to Linux, better
performance, and higher reliability, and he speculated that this might
stem from BSD's (and other "clean, lean, effective systems") having
been built using "a small set of principles extensively used, and a
sense of taste of what is good practice, clearly articulated by a
small team of mature, experienced people." Hume took Linux to task for
not demonstrating these characteristics, in particular for being too
bloated in terms of features, and for having been developed by too
large a team. Further, he singled out the Carrier Grade Linux effort
for special condemnation for "addressing zero of the [type of]
problems" he has had.
During Q&A, Margo Seltzer asked if perhaps part of the problem is that
the dependable systems community has traditionally been isolated from
the operating systems community. Hume agreed, and he suggested that
the dependability community may understand some things that the OS
community does not yet, such as the end-to-endness of the reliability
issue. Armando Fox asked if perhaps the problem is not the sheer
number of developers who contribute to projects that have turned out
imperfect but what many users perceive as "good enough" systems
(e.g., Linux and HTTP), but, rather, the need for invariants to which all
the developers would program, to help make such systems more
reliable. Fox suggested Hume's rule of "It does what you ask or it fails
within a modest bounded time" as one possible invariant. Hume agreed,
saying that he is working hard on educating the Linux community about
how to make the OS more reliable, particularly in large-scale HA
configurations like that used by Gecko. Jeff Mogul suggested that
perhaps what's missing is repeatable, easy-to-run OS dependability
benchmarks based on the kinds of failure anecdotes Hume was
describing. Jay Lepreau suggested that instead of taking a large
general-purpose OS like Linux and stripping it down to a reliable
core, an alternative approach would be to start with a reliable
application-specific OS and to add just the extra functionality that
might be necessary to run Gecko. Hume disagreed, suggesting that he
had more hope for making existing general-purpose OSes more
reliable. Tal Garfinkel pointed out that Hume had talked primarily
about performance scalability rather than absolute reliability, and
Hume agreed that for his application absolute single-node reliability
is less important than the ability to get predictable performance
scalability when more nodes are added and predictable performance
degradation when nodes fail. Ethan Miller asked if product liability
lawsuits might help the OS reliability problem, but Hume disagreed,
stating that he is "not keen on litigation to direct OS research."
Finally, Val Henson asked if OS reliability, while bad, is perhaps
"good enough" for most people. Hume agreed, but suggested that we
should raise our expectations.
10:45 a.m.12:55 p.m.: The Emperor's Clothes (Chair: Steve Gribble; scribe: Matt Welsh)
High Availability, Scalable Storage, Dynamic Peer Networks: Pick Two
Charles Blake and Rodrigo Rodrigues, MIT Laboratory for Computer Science
Charles Blake opened the first technical session with a talk on the
overhead of "maintenance bandwidth"network bandwidth consumed to
maintain a given level of replication or redundancyin a peer-to-peer
storage system. The basic argument is that maintenance bandwidth across
the WAN, not
the aggregate local disk space, is the fundamental limit to scalability in these systems. Given the dynamics of nodes joining and
leaving the system, Charles presented a conservative estimate of the
maintenance bandwidth that scales with the WAN bandwidth and average
lifetime of nodes in the system. Under a typical scenario (100 million
cable modems with a certain bandwidth available for replication, one
week average lifetime, and 100 GB storage per node), only 500 MB of
space per node is usable, only 0.5% of the total.
To try to address these problems, Charles looked at alternatives such as
admission control (only admitting "reliable" nodes) or incentivizing
nodes to have long lifetimes. It turns out that a small core of reliable
nodes (such as a few hundred universities with a single reliable machine
dedicated to hosting data) yields as much maintenance bandwidth reduction
as millions of home users with flaky connections. The talk concluded
with a number of open issues in organizing WAN-based storage systems,
such as whether it is appropriate to assume millions of flaky users and
whether the requirement of aggregate data availability should be
Armando Fox raised the point that, rather than high availability,
millions of home users really bring lack of accountability. The question
is whether there is any benefit other than pirating movies, and Charles
answered that perhaps circumventing legal requirements should be a top
design goal of these systems.
Eric Brewer argued that once a system becomes popular, it will be
centralized, but that P2P systems are really useful for prototyping.
Charles responded that, while there are benefits to collaboration,
one may as well use a small number of available,
well-connected/well-motivated friends to do a prototype.
Margo Seltzer wondered whether P2P could be useful to the Grid community
and their interest in harvesting cycles rather than storage. Charles agreed.
Andrew Hume pointed out that the Grid community is talking about needing
gigabits of networking between sites.
Mike Swift wondered about not
providing a disconnect button in the user interface, as a way of creating an incentive to users to offer long lifetimes.
Timothy Roscoe asked whether the tradeoff between storage space and bandwidth
usage was inherent, or whether it was a transient condition resulting from current
economic conditions. Charles replied that telcos may not have the
motivation to increase bandwidth upstream from home users if most
applications (e.g., Web surfing) require only a small amount.
One Hop Lookups for Peer-to-Peer Overlays
Anjali Gupta, Barbara Liskov, and Rodrigo Rodrigues, MIT Laboratory for Computer Science
Anjali Gupta presented a talk on the use of one-hop lookups in
peer-to-peer systems, avoiding the high latency associated with the
typical log(N) lookup paths required by most systems. The challenge is
keeping up with membership change information on all hosts. For example,
the UW Gnutella study in 2002 showed an average node session time of 2.9
hours, implying 20 membership changes per second in a system with 100,000
hosts. Anjali presented a hierarchical scheme, in which the address
space (forming a ring) is subdivided into slices, each with a slice
leader that is the successor to the midpoint in the slice. Slices are
further subdivided into units.
The basic approach is for nodes to
exchange frequent keep-alive messages with their predecessor and
successor nodes. A change to a node's successor is an event that is
propagated by piggybacking a recent event log onto keep-alive messages.
A node change event is relayed to the slice leader, which periodically
(every 30 seconds) notifies other slices of the updates. Internally
to a slice, slice leaders periodically (every 5 seconds) notify unit
leaders of node change information. Given some reasonable assumptions on
the size of the system, all nodes can be updated within 45 seconds of a
node leaving or joining the system, which permits a 99% "hit rate" for
an address lookup. In this scheme, it is important to choose good slice
leaders that are well-provisioned. Anjali concluded with a summary of
ongoing implementation and experimentation work, noting that systems
larger than a million nodes will require two-hop lookups.
Mike Jones noted that there would most likely be a burst of traffic to and
from a new node as it comes up, and he asked how a node just joining
could communicate when it would take the 45-second update propagation
delay before other nodes would know how to reach the new node.
Anjali responded that a node starts participating only after it
completely downloads a routing table from a unit or slice leader.
Dan Wallach asked about correlated failures, where a node and its
successor fail at the same time. Anjali responded that the odds of this
occurring are low and that this only introduces a bit more delay for the
update. Dan asked about malicious nodes, and Anjali responded that this
is not addressed by this work.
Ethan Miller asked about how the system scales up to a million-node
system; Anjali said that this system reaches its upper limit at that
size and does not scale beyond it.
Timothy Roscoe asked whether this system assumes that every node can
talk to every other node in the network directly, and whether this was
realistic. The response was that they were indeed making the
assumption that every node can talk to its neighbors and unit and
slice leaders. There are cases when this assumption may be invalid,
and this question will be addressed in future work.
Eric Brewer pointed out that there is a lot of work in industry on
hierarchical caching, allowing a big win since updates need not be
propagated unless an object is hot. Anjali said that they were
looking at designs where information was propagated for all objects,
and had not looked at schemes that would do bookkeeping for object
popularity. Nevertheless, it was a good idea and worth looking at.
An Analysis of Compare-by-hash
Val Henson, Sun Microsystems
Val Henson presented one of the most controversial papers of the
conference, admonishing those systems that rely upon comparison of data by
comparing cryptographic hashes of the data. Many systems (such as rsync,
Venti, Pastiche, LBFS, and OpenCM) use this technique, but it is not yet
widely accepted by OS researchers, due to little characterization of the
technique and many unanswered questions. The risk of collision using (say)
a 160-bit SHA-1 hash is just 2^-160, which is far lower than a hardware
failure or probability of an undetected TCP error. So why the controversy?
First, these techniques assume that data is random, but real data is not
random and has a fair amount of commonality (think about ELF headers
and English text). Second, cryptographic hashes were designed for
authentication and care about "meaningful" collisions, such as two
contracts with the same text but different dollar amounts that happen to
collide in the hash space. Third, hash algorithms are short-lived and
obsolescence is inevitablesystems need an upgrade strategy. Finally,
collisions are deterministictwo blocks that collide always colliderather than a transient error such as a hardware fault. Hash collision
is therefore a silent error in those systems that rely on compare-by-hash
techniques. Val claims that we should be striving for correctness in
systems software, not introducing "known bugs." It is OK to rely on
compare-by-hash when the address space is not shared by untrusted parties,
and when the user knows and expects the possibility of incorrect behavior, and he cited rsync as an example. Note that "double hashing" is not an
acceptable solution, as this results in just another hash function,
albeit one with a lower collision probability.
Some alternatives to compare by hash were discussed, such as
content-based addressing that checks for collisions; using compression
maintaining state only to send or store identical blocks once (as in LBFS);
sending diffs instead of an entire block; or using universal IDs for
Peter Druschel argued that Val was taking a purist view of correctness,
and that optimizations in general lead to complexity which reduces
reliability in systems. Val responded that this has to do with
tradeoffs, and that the tradeoffs taken in compare-by-hash were not the
Tal Garfinkel claimed that Val's arguments were based on "math FUD" and
that paranoia is unwarrantedafter all, all of cryptography is
based on this technique. Val responded that we should not assume that
hash functions designed for cryptography are appropriate for these
applications. Tal further explained that similar data does
not produce similar hashes, contrary to her assumption.
Andrew Hume pointed out that checksums work fine as long as there is a
uniform distribution across the hash space. He also commented that
while he is skeptical of Val's birthday paradox argument that with
2^80 documents there is 50% chance of a collision with a 160-bit hash,
that if true, that is a very scary probability and that a lot of
people would be much happier with a much smaller threshold like one in
a million or one in a billion. Val answered that this is a difficult
number to calculate (due to the infinitesimal quantities involved),
and indeed, that teams of people and weeks of time could not produce
the standard "M = sqrt(2 * p * N)" formula for the address space
pressure needed to produce a collision with probability p, but that
they believe it.
Dirk Grunwald asked at what point the hash collision probability
becomes infinitesimal compared to everything else, such as TCP or ECC
bit errors. Val again answered that the seriousness of the undetected
collision made it important to consider the possibility.
Why Events Are a Bad Idea (for High-Concurrency Servers)
Rob von Behren, Jeremy Condit, and Eric Brewer, UC Berkeley
Rob von Behren raised the argument of thread-based versus event-driven
concurrency in high-concurrency servers, claiming that thread-based
approaches are far better due to their ease of programming. To counter
the arguments that threaded systems have inherently higher overhead than
event-driven ones, Rob presented early results from a lightweight user-level thread
system that performed as well as an event-driven system on a Web server
benchmark. Furthermore, threads have better programming and debugging
tools, leading to increased productivity. To address the problem of high
overhead for per-thread stacks, Rob proposed the use of compiler support
to automatically compress stacks, for example, by moving "dead" elements
off the stack across a blocking operation. Using cooperative
scheduling avoids the overhead of generic thread synchronization;
however, there are some issues to address here such as fairness, the use
of multiprocessors, and how to handle spontaneous blocking events such
as page faults.
Rob pointed out that events have the advantage of permitting very complex
control flow structures; however, very few programmers use these
structures, and threads can capture the vast majority of scenarios.
Another problem with thread schedulers is that they are "generic" and
have little knowledge of application structure. To permit greater
cache locality, Rob proposed "2D" batch scheduling, in which the compiler
annotates the application code to indicate to the scheduler system where
the various stages of the thread's execution are located.
Rob presented some measurements of a simple Web server benchmark based
on his user-level threads package, capable of supporting over 100,000
threads, implemented in about 5000 lines of C code. The server
outperforms a Java-based event-driven Web server, probably due to the
large number of context switches in the event-driven system.
Rob concluded that it may be possible to achieve higher performance
using threads than events, in part because events require dynamic
dispatch through function pointers, which makes it difficult to perform
inlining and branch prediction.
Timothy Roscoe shared a vignette from his upbringing: "When I was young,
I was surrounded by big tall men with beards, who talked about something
called continuations." He asked whether this is what Rob was proposing
and whether we shouldn't be looking at what the functional languages
community is doing in this area. Rob was brief"Yes."
Margo Seltzer asked about the background of this work, to which Rob
explained that the UC Berkeley Ninja project had first gone down the
path of using threads to build a large-scale system, and then switched
to events to address some of the scalability problems. Both approaches
"kind of sucked," and when Rob started writing tools to fix this problem
he realized that a lean, user-level threading approach was the right
Jeff Chase asked about writing tools to make it easier to write in an
explicit event-based model, rather than a thread-based model. His
motivating example was distributed systems using callbacks, for which an
event-driven style of programming seems inevitable. Rob responded that
one does not need to use events to handle callbackswith cheap enough
threads, one can spawn a new thread to wait for the callback.
Charles Blake kicked it off by asking why Val's birthday paradox
probability was so hard to compute. She responded that essentially it
comes down to the infinitesimal numbers involved. Afterwards Charles
contended that while this was Val's response, that has nothing to do
with the actual state of affairs, since the next order term in M=sqrt(2pN)
is proportional to O((M/N)^2), which is very negligible on any of the
address-space pressure scales in question.
Eric Brewer pointed out that systems should use CRC, not MD5since CRC
is no good for preventing malicious collisions, there should be no illusion
that it is. One should also use a random salt with the checksum, which
should help with the non-randomness of real data. Val responded that if
one has to recompute the checksum across the actual data, then you are
losing the benefits of this technique.
George Candea raised the point that, although a P2P client that
prevents a user from disconnecting appears less desirable at first, it
would lead to higher availability for the service as a whole. This
makes the service more valuable, and hence provide greater incentive
to use it (i.e., download the client).
Ethan Miller asked whether people are really comfortable with the
concept of probabilistic storage. Val agreed that the notion of dynamic,
unreliable storage systems makes her uncomfortable.
Ranjita Bhagwan pointed out that Charles's calculations don't push P2P
out of the picture, asking whether there may be a cost benefit to a
peer-to-peer approach versus a centralized approach. Charles said that
fundamentally his argument was economic, concerning the bandwidth versus
storage requirement for these systems. Andrew Hume said that the best
nodes are professionally managed and that high-bandwidth connections and
support are expensive, so the economics of the two approaches are more
similar than they are different.
Mohan Rajagopalan said that compiler optimizations actually perform very
well for event-based systems, and that implementation is what really
matters. Event-based systems permit a decoupling between the caller and
callee, so it is easier to write an event-based "adaptive" program than one
in threads. Isn't this a fundamental benefit? Rob responded that events
do make it easier to perform composition and interpositioning, but that
this can also be done in the thread model. Eric Brewer mentioned that Click
is very configurable and runs as a single large thread.
Peter Druschel was skeptical that we can do P2P storage based on home
connected desktops, but that the alternative is not centralized systems.
For example, one can reap the benefits of unused desktop systems
within a large organization. Charles did not disagree with that.
This was followed by an exchange between Peter and Rodrigo Rodrigues
about using so-called "scalable lookup" (Pastry/Chord/CAN/...)
vs. some other organization for P2P file storage. Basically Rodrigo
pointed out that in a scenario where the individual nodes are very
available/reliable and the network isn't giant, there is no need for
scalable lookup and other considerations should take higher priority.
Peter responded that having a large number of nodes and security
implied the need for small lookup-state optimizations.
2:30 p.m.4:15 p.m.: Popping & Pushing the Stack (Chair: Geoff Voelker; scribe: Ranjita Bhagwan)
TCP Offload Is a Dumb Idea Whose Time Has Come
Jeffrey C. Mogul, Hewlett Packard Laboratories
TCP offload in the traditional sense violates performance
requirements, has practical deployment issues, and targets the
wrong applications. TOEs impose complex interfaces and cause suboptimal
buffer management. Moreover, lots of small connections overwhelm savings
because of connection management. Event management is a problem. Lots
of virtual resources need to be managed. Also, one of the main motivations
for TOE has been that TCP implementation in the OS is bad.
However, it is no longer a dumb idea, because now we are offloading higher-level protocols onto hardware. The justification for offloading TCP
is simply that you can't offload the higher-level protocols without
also offloading TCP. The sweet spot for TCP offload is when
the application uses very high bandwidth and has relatively low end-to-end
latency, long connection durations, and relatively few connections. For
example, storage server access and graphics. Also, several economic
trends are favorable to TCP offload. One would like to replace
special-purpose hardware with cheap commodity parts, such as 1 gig or
10 gig Ethernet. With these in place, operators will have
only one kind of fabric to provision, connect, and manage. Still, many
challenges remain. Data copy costs still dominate, and busses are too
slow. Zero copy and single copy seem too hard to adopt in commercial OSes.
However, with the advent of RDMA, vendors want to ship RNICs in volume,
allowing one kind of chip for all applications. It would mean cheaper
hardware. Also, there are several upper-level protocols available,
such as NFSv4 and DAFS. Still, many problems still
exist. There are security concerns, and so far the benefits have been
elusive. The new networking model may require changes to traditional
OS APIs. Systems people need to give this due consideration.
Margo Seltzer observed that by coupling RDMA with TCP in hardware, some of
these issues might be resolved, since now you would get a generic version of
RDMA. Jeff said that offloading even higher-level protocols is possible
Rob von Behren asked what the write interface should be like. Jeff
responded that standardization of APIs is not a priority. Currently,
the concentration is on standardization of primitives. However, he does
not know if that is sufficient.
Val Henson said that Linux has a commercially viable implementation of
zero-copy that uses TCP checksum implementation in the NIC.
TCP Meets Mobile Code
Parveen Patel, University of Utah, David Wetherall, University of Washington, Jay Lepreau, University of Utah, Andrew Whitaker, University of Washington
The authors address the problem of deployment of transport protocols
by proposing an extensible transport layer called XTCP. The main
argument is that transport protocols such as TCP need a self-upgrade
mechanism, and untrusted mobile code can help build such a mechanism.
Several modifications to TCP, as well as alternative transport
protocols, have been proposed. However, as with any new protocol,
deployment is an issue. Currently, it takes many years before a new
protocol or an extension can be used by applications: a new protocol or
extension has to be approved by standards committees, implemented by
OS vendors, and finally enabled-by-default at both ends of
In the proposed solution, untrusted peers can upgrade each other with
new transport protocols using mobile code. A typical usage scenario is
that of a Web server. A Web server can download a high-performance
version of TCP, after which it tells every client to download the same
version from it. Then the client and the server can speak the upgraded
version of TCP. This solution avoids all the steps of the deployment
process that need approval and support from third parties, such as
standards committees and OS vendors.
There are several challenges to building such an extensible layer,
especially of host and network safety. The presenter contrasted XTCP
with "active networking" and argued that the domain of transport
protocols is restricted enough that host and network safety
challenges can be met without degrading performance.
Host safety is assured by providing memory protection and resource
control. Memory protection is achieved by using Cyclone, a typesafe
C-like language. The stylized memory usage pattern of TCP
extensionsno shared state between extensions and predictable
ownership of data buffersmakes resource control possible using
traditional runtime methods. XTCP uses the well understood notion of
TCP-friendliness as a measure of network safety. All extensions
written using the XTCP framework are forced to conform to
TCP-friendliness using the ECN nonce mechanism. In contrast, active
networking had no such well-defined notion of network safety, and host
safety in the face of arbitrary code was costly.
XTCP has been implemented in FreeBSD 4.7. Support for user-level
transports is being developed currently.
Constantine Sapuntzakis asked how many extensions could be implemented on the
XTCP API. Parveen said that 25 could be. The other two that they considered
were not TCP-friendly.
Peter Steenkiste asked who builds the headers. Parveen responded that
XTCP builds the IP header, while the extension builds the transport
header. Peter then asked, if the definition of TCP-friendliness
changes, how would XTCP work? The speaker said that, currently, changing
the definition of TCP-friendliness would require a regular upgrade of
the operating system.
Val Henson said that she was not convinced that XTCP could be secure.
She thought extensions could be malicious or non-performing
in more ways than stated, although she lacked specific examples.
Parveen responded that non-performance of
extensions is not a threat to XTCP. This is because extensions can hurt or benefit only the code provider. For example, if Yahoo accepts
code from a client, then that extension will be used only in sessions
with that client, so the extension provider has an incentive to
provide good code.
Geoff Voelker asked what would happen in the case of worm spread. One
can compromise a server such as yahoo.com and then use this mechanism to
spread worms. Each worm would behave in a TCP-friendly manner, but many
worms could collaborate to attack a remote host. Parveen responded that
such a worm could not be implemented using XTCP alone, because the IP
header is built by the XTCP layer, and the extensions cannot affect
it. Therefore extensions cannot send data to an arbitrary host unless
an application explicitly opens a connection to it and sends
data. Furthermore, an extension received from yahoo.com would be used
for communication with yahoo.com alone. That makes XTCP as harmless, or
as harmful, as the currently deployed trusted versions of TCP.
Exploiting the Synergy between Peer-to-Peer and Mobile Ad Hoc Networks
Y. Charlie Hu, Saumitra M. Das, and Himabindu Pucha, Purdue University
There appears to be a number of similarities in the problems that
research addresses in peer-to-peer and mobile ad-hoc networking. One
such area is that of routing. The speaker showed the similarity
to the problems solved by Pastry and how it can be used in ad-hoc
networking. He described a new protocol, DPSR, which stores
routing state in a manner similar to Pastry's. This reduces routing
state per node from O(N) to O(log N). DPSR uses node ID assignment,
node state, routing, node join procedures, and node failure or out-of-reach
in much the same manner as Pastry, inherits all DSR
optimizations on source routes, and contains a number of additional
optimizations related to Pastry's routing structures and operations.
Simulations of DPSR for a 50-node system show that the routing overhead
of DPSR scales better than that of DSR. In short, DPSR outperforms DSR
when the number of connections per source is greater than 1. It ties
Timothy Roscoe asked how many routes were going through the Pastry
routing for more than one hop. Charlie said that though he did not
specifically measure this number, the average number of overlay hops
for the 50-node simulation is around 1.5.
Srini Seshan asked why not do a low TTL flood optimization
to DSR. Charlie said this does not help in general.
Bogdan Popescu asked Parveen (XTCP) if you could use a signing mechanism
to detect unresponsive connections. Parveen said that the nice thing
about XTCP is that it works well without it. Bogdan said that then you
could have DoS attacks.
Rob Von Behren said that it would be very easy to do DoS
on XTCP, such as malloc'ing large amounts of memory, using a lot of CPU
time, etc. Parveen said that each malloc call is accounted for. Rob
responded that there is the problem of DDoS. With a considerable
number of nodes using a little too much memory, one could perform a DDoS
attack. Parveen said that this is possible; the only way to avoid it is
strict admission control.
Jay Lepreau brought up Geoff
Voelker's question again, and said that the code is identified by a
hash. One could develop a history-based code-rating scheme, so that if
a specifically "bad" user tried to use an XTCP extension, the server
would disallow it. But then that user could keep changing the
Jeff Mogul said that you have to make sure that XTCP itself is not
subvertible. Because if it is, then it is a very rich environment for
Peter Steenkiste said that in the early '90s, after 6 months of effort,
he had decided that TCP offloading is no good. In general, enthusiasm for
TCP offload seems lukewarm. He asked Jeff if it would take off. Jeff
responded that he does believe that it will take off, mainly for
commercial reasons. Having only one fabric to manage for datacenters
seems good. Peter said that there appears to be a contradiction somewhere
here. Earlier on, we wanted to move things to the software level, and
now attempts are being made to move them to hardware. Jeff said that
switches are clearly a larger investment than NICs, so commoditizing
the NICs would be good.
Geoff Voelker asked Charlie about the benefits of his approach depending
on the number of shared source routes. Did he have a sense of the minimum
shared source routes needed for DPSR to work? Charlie answered
that so far, the sharing was small, but even in this scenario, DPSR does
not do worse than DSR, so it seems overall to be a gain over DSR.
4:45 p.m.5:35 p.m.: Distributed Systems (Chair: Jeff Chase; scribe: Amit Purohit)
* Scheduling and Simulation: How to Upgrade Distributed Systems
Sameer Ajmani and Barbara Liskov, MIT Laboratory for Computer Science; Liuba Shrira, Brandeis University
In the first talk of the session Sameer presented a solution for
upgrading distributed software automatically with minimal service
disruption. He presented a technique which uses a combination of
centralized and distributed components. The infrastructure consists
of three main components. Scheduling functions tell the node when
to upgrade. Simulation objects enable communication among nodes
running different versions. Transform functions change a node's
persistent state from one version to a higher one.
Timothy Roscoe pointed out that this method is expensive. He
claimed that instead of upgrading each node and preserving the state,
it is possible to ignore the state. Sameer agreed that one can ignore
the state and allow the system to recover it, but said that
this reduces the rate at which upgrades can happen and may require
additional state transfer over the network. A. Veitch asked Sameer to
compare his work with the existing work on on-line upgrades. Sameer
said they are mainly targeting distributed systems.
* Development Tools for Distributed Applications
Mukesh Agrawal and Srinivasan Seshan, Carnegie Mellon University
In the second talk of the session Mukesh explained the motivation
for his current research. He claimed that there are few distributed
applications, not because they are not useful, but
because they are very hard to implement. He cited
routing table upgrading for distributed applications as one of the
hard problems. He mentioned ns-2 simulator as a tool that helps to
compare design choices. DHT is developing building blocks to
help in implementation of distributed systems. Mukesh then pointed out
some inherent flaws in the current approaches. The main
concentration of the research is on the initial stages of the
life-cycle of the applications, whereas his work mainly addresses the
later life-cycle stages of the application.
George Candea asked whether there is any good reason to build a distributed
application instead of a centralized one. Mukesh answered that control over
the application is the key. George argued it is not safe to give up
control, to which Mukesh replied a single trusted entity is not
* Virtual Appliances in the Collective: A Road to Hassle-free Computing
Constantine Sapuntzakis and Monica S. Lam, Stanford University
In this talk Constantine envisioned a computing utility that runs
not only Internet services but highly interactive applications that
are commonly run on desktop computers. Patches arrive at a high rate.
Multiple applications share a lot of elements, such as the OS and shared
libraries. Hence, an upgrade to one application can break another.
He argued that it is possible to borrow idea from appliances to
improve the manageability and usability of computers. Then he
described the architecture of their framework. Appliances are
maintained by makers without user involvement. Cheaper hardware
made virtualization an attractive option.
Jay Lepreau was concerned about the deployment of this idea in wide
use. He was skeptical about whether people would give up their
autonomy. He said it may be advantageous to separate the
application into sets but even that is not problem-free.
Constantine replied that the appliance could be fairly small, making
its deployment easier and reliable. David Oppenheimer asked how
this would differ from application services servers. Constantine said
it is more flexible than the centralized techniques. Andrew Hume
joined the discussion at the end. He said it is very important that
the interface be extremely simple. This is a real challenge, since
appliances are dumb. Constantine is offering a simple user interface; hence
it is not a complex mechanism, so it is hoped that it will be easy to use
and fun for the user.
* POST: A Secure, Resilient, Cooperative Messaging System
Alan Mislove, Ansley Post, Charles Reis, Paul Willmann, Peter Druschel, and Dan S. Wallach, Rice University; Xavier Bonnaire, Pierre Sens, Jean-Michel Busca, and Luciana Arantes-Bezerra,Université Paris VI
The author presented a p2p solution that interoperated seamlessly with a
wide range of collaborative services by providing one serverless
platform. It provides three basic services to applications:
a secure single copy message storage; an event notification
service which alerts users to certain events such as availability of
a message; and single writer logs, which allow
applications to maintain metadata. The author claims that their
features are sufficient to support a variety of collaborative
Mike Jones argued that application-level multicast will achieve the
same goals. The author said the main motivation is to prevent spam.
Armando Fox asked, what is the motivation behind choosing email as
the primary application. The author replied, email has great locality
properties. If such a demanding application can be changed, then POST
can be extended to more complex applications.
In the panel session, M. Swift asked Constantine about the cost of
complexity. Also, device drivers talk to hardware, so they
couldn't be virtualized. MOreover, if they are buggy, then they can
crash. Constantine said in future device drivers would be written
in user land, then the problem could be solved. But if the application
crashes, not much can be done.
Eric Brewer stated the view that the hard part is sharing information:
having separate virtual appliances for everything only works if they
don't share any information, which means that the user must replicated
all "shared" information by hand (as we do now with real appliances,
e.g., setting the clock). The path of safe sharing leads you to shared
segments as in Multics, including layers (rings) and call gates for
protected calls into shared resources. The author replied that Multics
has some problems, which they are planning to address as well.
8:30 p.m.10:30 p.m.: Outrageous Opinions Session (Chair: Jeff Mogul; scribes: David Oppenheimer and Matt Welsh)
Scribe Notes by David Oppenheimer
The Outrageous Opinions session featured 13 speakers.
Val Henson kicked off the session by suggesting that end-to-end
checksums replace component-to-component checksums. Andrew Hume
disagreed with this philosophy, pointing out that
component-to-component checks are vital for pointing the finger at the
component that has caused the problem.
Matt Welsh presented "a brief history of computing systems research,"
in which he urged computer scientists to think about how their
research can help to address social problems. He pointed out that
computer scientists have always worked on improving life for computer
scientists, focusing on improving their own day-to-day tasks. He
suggested that computer scientists should think more about social
problems rather than "How can I download porn and pirate music more
efficiently?" In particular, he cited education and health care,
particularly outside the United States and Europe, as social
problems that computer scientists could help tackle, for example by
empowering local government and remote communities. Specific
technologies he cited were peer-to-peer wireless networks for emergency
and disaster response systems; censorship-proof electronic
publishing of dissenting political opinions; sensor networks for
environmental monitoring, precision agriculture; inexpensive
medical diagnostics; highly reliable power environments; and
Mendel Rosenblum talked about the similarities between microkernels
and virtual machine monitors (VMMs) for running multiple untrusted
environments on top of an operating system. Both provide isolation,
resource management, and a small operating environment with simple
interfaces, leading to high assurance. The key difference is what runs
on top of eachfor a VMM you don't need to port applications to run
on it, whereas for a microkernel you do. He pointed out that despite
this advantage for VMMs, academics and industry researchers are often
interested in microkernels, because by not leveraging existing
software, microkernels provide academics an opportunity to write lots
of papers and industry an opportunity to define APIs, leading to an
industry control point.
Mike Chen presented "two easy techniques to improve MTTR": redirect
users' attention to what's still available when something becomes
on the user (e.g., "Did you forget your password? Please try
again."). He pointed out that by tricking the user, perceived
availability can easily be increased.
Timothy Roscoe presented a short talk entitled "Scalability Considered
Harmful." The core of his argument was a rejection of the traditional
dogma that maximum scalability is always a good thing. Instead, he
argued that systems should scale as far as required, but no
further. He argued that extra scalability is bad because it
facilitates abuse and concentrates power. Moreover, he claimed that
non-scalable systems are always more efficient than scalable
systems. As an example, he pointed out that email scales globally, but
it doesn't need tothe real social communication graph is cliquey,
but because email scales beyond the group of people with whom a user
really wants to communicate, we have the problem of spam. A second
example he cited was one he credited to Chris Overton: an unscalable
mechanism that some societies use to solve trust problems is reputation
and referral, which works, while a scalable mechanism that some
societies use to solve trust problems is lawyers, which doesn't work. A
third example Roscoe gave was that of email whitelists versus blacklistswhitelists don't scale but are highly effective at preventing spam,
while blacklists do scale but are much less effective. A final example
he cited was Google, a highly scalable, indispensable system. He
argued that the downsides of Google's success are that they have a lot
of power, it's not possible for someone else to deploy a competing
search engine anymore, and paper-based publishing and libraries have
declined. The non-scalable alternative he argued for instead of Google
was traditional libraries. Roscoe conclude by summarizing that
scalable systems benefit the system owners to the detriment of the
system users. He argued that by designing scalable systems, we as
researchers collude with the concentration of power in the hands of
the few. He therefore suggested that researchers stop designing
Dan Wallach argued that the logical extension of the virtual machine
is the process.
Eric Brewer issued a call for IT research for developing regions. He
made 5 claims: (1) there is a need for a direct attack by developing
new devices rather than giving developing regions hand-me-down
machines, which are a bad fit on cost, power, user knowledge,
administration, and user literacy; (2) there is a need for
infrastructure to support thousands of projects which are currently
not sharing any infrastructure; (3) building IT for developing regions
is economically viable, in that there is a market of 4 billion people,
but IT must create income for its users (e.g., by offering a franchise
model akin to that used to provide cell phone service to rural
Bangladesh) because users do not have disposable income; (4) the time
is right with the availability of $5 802.11 chipsets, systems on a
chip, low-power designs, and solar panels; and (5) this work can have
a big impact by reducing poverty and disease and improving the
environment, by providing developing regions with a source of income
that they can in turn use to improve their standard of living,
stability, and security. In terms of concrete systems research areas,
Brewer suggested wireless networks, spoken user interfaces for regions
with low literacy rates, screens, proxies, and sensors. He indicated
that he is starting to work on the problem of IT research for
developing countries, and he urged the systems community to consider
it to be a part of their research agenda.
George Candea discussed why he believes that wide-area decentralized
systems such as peer-to-peer networks are "a good idea whose time has
passed." He argued that such systems are hard to build, test, debug,
deploy, and manage, and that they have little economic incentive
beyond "lack-of-accountability" applications. He suggested that the
principles learned from building wide-area distributed systems,
e.g., strong componentization, using open protocols, loose coupling,
reducing correlated faults through diversity, and writing components
while keeping emergent behaviors in mind should be used to build
highly dependable "distributed" systems within the datacenter. He
summarized by saying, "Don't distribute centralizable apps into the
wide-area network (WAN)take the good ideas from distributed systems
and apply them in the system-area network (SAN)."
Geoff Voelker parodied the evolution of data prediction schemes from
value prediction (cache previously loaded value and speculatively
execute on next load) to "result prediction" (don't bother running the
program, just guess the results). He presented several possible
applications of result prediction: 5% of the time have gzip just exit
with "not in gzip format"; occasionally have the Web browser return
random data from its cache, forcing the user to hit reload;
occasionally tell the user that he or she has new mail with the
"African bank spam"; andthe killer application for result
predictionhave SETI@HOME predict no ETs.
Emmett Witchel argued for a "thin malloc" that does not spread data
around the address space, but instead keeps it compacted into the
smallest memory region possible.
Ethan Miller proposed SCUBA: Scalable Computing for Underwater Biota
Assessment, in which 802.11 networks would be deployed on coral
Sameer Ajmani suggested that systems researchers consider work in
computational biology: "We help biologists, then they help thousands
of people through pharmaceuticals, genetics, etc.it's easier than
sending computers to Africa." As specific examples of computational
biology problems that can directly apply well-known computer science
algorithms, he cited string alignment (dynamic programming) and
database searches for genes (hash table with 2-tuples and
3-tuples). Andrew Hume added that the National Institutes of Health
also has more money than does the National Science Foundation.
Armando Fox called for a bet as to whether a peer-to-peer application
would exist before the next HotOS, that would make more sense
(economically and technically) to deploy as a peer-to-peer system than
as a centralized service. 7 people in the audience said yes, 17 said
no. Armando offered to bet someone (Armando taking the "no" side) for
a case of alcohol valued less than the conference registration fee in
2005, but there were no takers.
Scribe Notes by Matt Welsh
In classic HotOS tradition, the Outrageous Opinions session consisted of
a stream of short presentations, some serious, some mundane, some
hilarious. Jeff Mogul tempted potential participants with a "local
delicacy" with a "value of up to $1000" concealed in a paper bag.
The level of disappointment in the room was palpable when it was
revealed to be a can of Spam.
Val Henson argued against the use of checksums at all levels in a
storage system versus end-to-end checksums at the application. Andrew
Hume countered that it's good to have accountability at each level when
something goes wrong in the system.
Matt Welsh presented a brief but accurate history of systems research
goals, culminating in the current trend towards supporting faster
downloads of porn and pirating of MP3 files. He argued that systems
researchers should be focusing on social problems, such as education
and healthcare in developing countries.
Mendel Rosenblum contrasted two approaches to running two untrusted
environments on the same hardwaremicrokernels and VMMs. He wondered
aloud who would want to build a microkernel that doesn't
leverage existing software, then answered his own question with two
possibilities: academics who want to publish papers, or corporate
vendors who want to exercise a "control point." Mike Jones asked who
these vendors were that were pushing new microkernels, and Mendel
responded that Microsoft's Palladium system would be built this way.
Much more was said on Palladium in Tuesday's evening session.
Mike Chen argued the opposite of Val Henson's talk, claiming that one
should always use compare by hash, since in compare-by-data the error
rate is proportional to the data size and is much larger than the hash
collision rate. His real talk, though, was about the Berkeley/Stanford
Recovery Oriented Computing project, which aims to improve MTTR in
systems. He raised two ways to do this: first, when something fails,
redirect the user's attention to something else ("read our NEW privacy
policy"), or simply blame it on the user. This buys you tens of seconds
to several minutes of wait time.
Timothy Roscoe railed against the construction of scalable systems,
claiming that systems should only scale as far as needed and no further.
For example, email does not need to scale globallyafter all, who
needs to send an email message to everybody? After all, whitelisting your email
to reduce spam doesn't scale, but it works. Mothy used Google as an
example of a system where extra scalability only concentrates power.
He cited libraries as a non-scalable alternative to Google.
Dan Wallach talked about all of the recent work on virtualizing
resources and getting multiple virtual machines to share resources, such
as shared libraries or the OS kernel. He proposed an alternative to
these approaches, a radical concept called a "process."
Eric Brewer brought the proceedings back to a more serious note with a
call for IT research for developing regions, claiming that we need a
direct approach rather than hand-me-downs from previous technology.
For these applications we need systems with very low cost and power that
make few assumptions about user knowledge or literacy. He pointed out that
there are thousands of ongoing IT projects in developing countries but
with little sharing. Developing an infrastructure that all these
projects could use would be very helpful. Lots of research
directions here, including very low-power and low-cost wireless
communications, new speech-based user interfaces, network proxies, and
George Candea argued that WAN P2P distributed systems are a good idea
whose time has passed. They are hard to build, test, debug, deploy,
and manage, and they offer little economic incentive. He argued that the good
distributed systems ideas should be used within the data center instead.
Geoff Voelker presented a novel idea based on the notion of value
prediction from hardware architecture"result prediction."
Rather than running the program, we can simply guess the results,
leading to excellent speedup potential! Examples: 5% of the time,
gunzip could spontaneously report, "Not in gzip format." Web browsers could
simply return random stuff from the browser cache. When retrieving new
mail, your inbox could simply predict that it's an African Bank spam.
The killer app for this technique is clearSETI@Home.
Emmett Witchel asked whether there is a use for anti-optimization.
One application he suggested was to allow users to specify in advance the
amount of resources a computation takes, possibly eliminating covert
channels. This would be accomplished by intentionally adding delay
loops to code to use all available CPU and by spreading allocated
memory all over the address space.
More important, Emmett suggested that the last session of the
conference be moved forward to replace the breakout session, so that
those who had early flights on Wednesday could attend. Most agreed.
Ethan Miller expressed concern about the environment, particularly the
disappearance of many
species of fish, since we're eating them all. He proposed a sensor-based
approach to tracking fish populations, called Scalable Computing for
Underwater Biota Assessment. This project can't use 802.11, as
it would boil the fish. The main value of this project would be the
ability to check the box on the grant proposal form explaining that SCUBA diving
is being used in the project.
Sameer Ajmani upheld Matt and Eric's calls for socially relevant
applications, pointing out that computational biology has the potential
to help thousands of people, and there are real systems problems here
(such as managing enormous databases).
And the SPAM award goes to Geoff Voelker!
Tuesday, May 20, 2003
9:00 a.m.11:15 a.m.: When Things Go Wrong (Chair: Dan Wallach; scribe: Amit Purohit)
George Candea and Armando Fox, Stanford University
The session began with a talk from George Candea in which he explained
how to build Internet services that can be safely and cheaply
recovered by crash-rebooting minimal subsets of components. He
pointed out that most downtime-causing bugs are transient and
intermittent, and it is not feasible to guarantee that an application
can never crash. Even with recovery-safe applications, recovery can take too
long. Crash-only software achieves crash safety and fast recovery by
putting all important nonvolatile state outside the application
components, into crash-only state stores. For
components to be crash-only, they must be decoupled from
each other, from the resources they use, and from the requests they
process. He conceded that steady-state performance of crash-only
systems may suffer, but argued that (a) the overall goal is to
maximize the number of requests successfully served, not to serve some
quickly and then be unavailable for a long time, and (b) that techniques
will evolve that will improve performance of crash-only systems, the
way compilers improved the performance of programs written in high-level languages.
Andrew Hume asked how to set timeouts for crashing the systems
deliberately. George said that timeouts are set in
application-specific ways, at the entrance to the system (e.g., in the
Web tier, based on URL). Certain known thresholds can be exploited
(e.g., respond within the TCP timeout, or before the HCI-established
8-second distraction threshold for humans). Margo Seltzer asked George
to compare their work with that of Stonebraker. George replied that
Stonebraker's pioneering ideas were geared mostly toward data
management, while their work extends this to recovery management of
Internet services. Somebody from Colorado was concerned about
the possibility of Do+S attacks by maliciously exploiting this technique.
George answered that a stall proxy holds up new requests during
recovery; this form of admission control allows the system to quiesce.
The Phoenix Recovery System: Rebuilding from the Ashes of an Internet Catastrophe
Flavio Junqueira, Ranjita Bhagwan, Keith Marzullo, Stefan Savage, and Geoffrey M. Voelker, University of California, San Diego
This presentation explains the design of an operative, distributed,
remote backup system called Phoenix. Operating systems and user
applications have vulnerabilities. Large numbers of hosts may share
vulnerabilities, resulting in major outbreaks. Phoenix uses a strategy
based on attributes and cores. By replicating data on a set of hosts
with different values for each attribute, it is possible to reduce
the probability of error to a negligible number. In the Phoenix system
there is no single point of failure; copying with a large number of
requests is achieved by exponential backoff.
Somebody from MIT was concerned about Phoenix's flexibility. The answer
was that the system offers sufficient flexibility to
allow different failures. Popescu from Vrije University was
concerned about the backup storage, as a Linux machine will always
store 10 times more than a Windows machine.
Using Runtime Paths for Macroanalysis
Mike Chen, University of California, Berkeley; Emre Kiciman, Stanford University; Anthony Accardi, Tellme Networks; Armando Fox, Stanford University; Eric Brewer, University of California, Berkeley
The authors emphasized the
benefits of microanalysis, namely latency profiling, failure
handling, and detection diagnosis. They introduced the concept of runtime
path analysis, in which paths are traced through software
components and then aggregated to understand global system behavior
via statistical inference. Runtime paths are also used for failure
handling and for diagnosing problems, all in an application-generic
fashion. The group explained that their work could be extended to
p2p message paths, event-driven systems, forks, and joins.
To Margo's query about the future of this approach, the authors
answered that similar techniques are being used by designers for big
projects, thus promising stability.
Magpie: Online Modelling and Performance-aware Systems
Paul Barham, Rebecca Isaacs, Richard Mortier, and Dushyanth Narayanan, Microsoft Research Ltd, Cambridge, UK
In this work, the author explains a modeling service that collates
traces from multiple machines, extracts request-specific audit
trails, and constructs probabilistic models of request behavior.
He mentions that workload description and hardware model could be
used to predict performance. It is possible to augment the
system by adding feedback from past models. To understand the
behavior under all scenarios and use the knowledge,
behavioral clustering is employed.
A researcher from MIT asked whether the Markov model could be useful. Paul
said that some of the Markov solutions could prove useful. Somebody from
Washington asked Paul to prove that an unusual activity is always a
fault and not a genuine activity. Paul said detection of unusual activity is the
basis of intrusion detection systems.
* Using Computers to Diagnose Computer Problems
Joshua A. Redstone, Michael M. Swift, and Brian N. Bershad, University of Washington
Joshua talked about building a global-scale automated problem
diagnosis system that captures the workflow of
system diagnosis and repair. The computer generates search terms and
locates problem reports. It stores problem reports in a canonical
global database. When a problem occurs, the computer detects symptoms
and searches the problem database. Joshua argues that more
effort building the database could be a good trade-off, resulting in cheaper diagnoses.
He said the main challenge lies in creating a database
structure so that searching meets user expectations.
Mike Jones commented that Windows XP has an automated bug-reporting
facility. Joshua said that the XP facility only reports application
crashes. Timothy Roscoe said that the main challenge lies in the
development of a formal language. Joshua addressed that
concern by saying that this is a simple structure. But he agreed that it
will be difficult to design the system if problems are disjoint,
but he argued that that would not always be the case.
Val Henson said Google works perfectly fine for identifying error
messages. Joshua argued that his system can give more information,
such as the version, which will be more helpful in solving the problem.
In the general discussion Armando Fox asked a question regarding the
deployment-of-diversity problem. Another question was how this
system impacts the time needed to recover. The author said he didn't have the
final answer, but it should be feasible without much trouble. Jeff
Mogul remained concerned about time to recovery. The author said there
aren't critical time-recovery requirements, as it is more important
to get the data back. He also pointed out that it is possible
to make it faster by adding redundancy. But it is a trade-off
between storage and time.
Somebody from MIT asked about the deployment of Phoenix. The author said it
seems reasonable if there are enough hosts running diversified OSes.
Rive Peter compared the Phoenix approach with the randomized replica
approach, and questioned whether, with a small number of attribute values,
Phoenix really performs better than replicas. The author said that in the randomized
replica approach, it is very hard to implement the recovery solution.
Ranjitha from UCSD asked Joshua what
would be the motivation for people to fill the database. Joshua
said organizations tend to use external support, so people would find
the increased ease of resolving their problem worth the effort. Mike Jones
proposed an alternative scheme which asks the user for a request
and if it is solved, then posts the solution. Joshua was not sure how
to maintain the database. Mike said you could save the entire request
and response sequence and then process it offline.
Jay Lepreau asked George
(Crash-only software) about his project's impact on throughput. George
said it's likely to be lower, because of the inherently distributed
approach (loose coupling, explicit communication, etc.) to building
the system, but with time we will learn how to improve performance, citing as
an example going from assembly language to high-level
languages. High-level languages enabled a qualitative jump in the
types of software written, even if the actual writing of code was slower
than in assembly. He also argued that "goodput" (throughput of
successfully handled requests) is more important than absolute
throughput. There was a follow-up question regarding the
throughput of applications that do not fail. George said
that it's true nobody needs to recover applications that never fail,
but that such apps do not exist.
11:45 a.m.12:30 p.m.: Performance Optimization (Chair: Jeff Mogul; scribe: Ranjita Bhagwan)
* Using Performance Reflection in Systems Software
Robert Fowler and Alan Cox, Rice University; Sameh Elnikety and Willy Zwaenepoel, EPFL
The main idea of this work is to use application-independent measures
such as hardware instrumentation mechanisms and general system
statistics to adapt system behavior. Performance indicators such as
TLB misses and cache misses can be used to measure overhead, while measures of productivity include
bytes sent to a network card, as well as flop rate. Productivity and overhead are used to determine whether the
system needs to be tuned. Sameh Elnikety showed how server throttling of mySQL using the TPC-W workload succeeded
in keeping the throughput at the maximum level while load increased,
whereas, without using reflection, the throughput dropped at higher
There was no Q/A.
* Cassyopia: Compiler Assisted System Optimization
Mohan Rajagopalan and Saumya K. Debray, University of Arizona; Matti A. Hiltunen and Richard D. Schlichting, AT&T LabsResearch
Cassyopia combines program analysis and OS
design to do performance optimizations. Whereas the compiler offers a local
perspective on optimizations, the OS provides a more general view. Cassyopia tries to bring about a symbiosis between the OS
and the compiler. An example of this symbiosis is system call optimization. Cassyopia
profiles system call sequences, then clusters them using compiler
techniques. The clustered system calls are called multi-calls.
Mohan described preliminary results that showed that
significant performance improvement can be obtained using multi-calls.
Margo Seltzer asked how one would do error-checking with clustered system
calls. How would you know which system call generated the error? Mohan said that they have a means of
Constantine Sapuntzakis asked for an example of a case in which multi-calls
would be useful. Mohan said that Web servers
employ a repetitive system call behavior. Another example would
be database systems.
* Cosy: Develop in User-Land, Run in Kernel-Mode
Amit Purohit, Charles P. Wright, Joseph Spadavecchia, and Erez Zadok, Stony Brook University
User applications are only allowed restricted access, which causes a lot
of crossings of the user-kernel boundary. To prevent this,
Amit Purohit proposed a compound system call, named Cosy, using which
one can execute a user-level code segment in the kernel. The authors
have a modified version of gcc that uses Cosy. Kernel safety is ensured
by limiting kernel execution time; x86 segmentation and sandboxing
techniques can also be used. The performance benefits of Cosy have been
reported at 2080%.
Matt Welsh said that there is enough stuff in the OS already that adding more would
not gain much. No performance improvement is justified by the loss of
reliability that this technique would introduce. Amit said
that, at some level, the OS has to be trusted. If it isn't trustworthy,
there are lots
of other problems that need to be dealt with.
Andrew Hume said that people from Sun Microsystems have said that code
executes best in the kernel. Matt Welsh reiterated his statement that
reliability was still more important than performance.
Constantine Sapuntzakis asked where the performance improvement
came from. Amit said that the savings came from avoiding
context switches and from read and write system calls.
Constantine asked if the authors had compared their solution to
zero-copy solutions. Amit said that zero-copy solutions are applicable
only to certain applications; the Cosy solution is more general.
Ethan Miller asked what operations this is useful for, other
than zero-copy. Amit said that zero-copy is the primary use right now, since the
project is still in its initial stages. They plan to look at I/O next. Andrew Hume said that context switches were
important too. Amit said that using compiler techniques and C,
it is possible to exploit new avenues for performance improvement.
Mike Swift said that combining the last two papers might be one way to proceed:
take the compiler techniques from Cassyopia to
use for kernel execution. Mohan
clarified that they do not plan to move any user-level
code into the kernel, since he believes that interpretation in the
kernel has high overhead. They primarily wanted to reduce the boundary
crossing cost, so the two ideologies are not the same. Amit said that
interpretation in the kernel is a bottleneck, and so they use a small
interpreter. Checking pointers would cost a lot, and so they are using
TLBs. This has some overhead, but it's a one-time check. Mohan brought
up the point that the aim of Cassyopia is to apply optimizations that
are quite obvious, but have not yet been done.
Ethan Miller said that the TLB overhead in Cosy could be
unavoidably high. Amit said that is true, but the savings they are getting
more than make up for the TLB overhead.
Margo Seltzer drove the nail home by saying that what matters
finally is the kernel-user API. All this work on extensible OSes and
moving code across the boundary is probably done because the kernel-user
API needs to be revisited. So let's fix the API. Applause.
Andrew Hume said that for the applications he has looked at, apart
from zero-copy and I/O, there is not much to be gained by putting code
into the kernel. Apart from the stated scenarios, the chances of using
a multi-call are small. Sometimes reassurance outweighs performance
benefits. Mohan said that they are also looking at smaller devices, such
as cell phones. All devices are resource-constrained. In these cases,
there is also the issue of energy savings.
Eric Brewer asked why, for small devices, you would even want a kernel boundary.
Mohan answered that cell phones and iPAQs can now have JVMs running on
them. They are not targeting reprogrammable devices.
Matt Welsh asked why one would care so much about performance on an iPAQ.
Timothy Roscoe said that since there are different kinds of devices,
there should be different OSes for them. And then the question would
be where, if anywhere, to put the kernel boundary on them. Whether there is a generic
answer to that question is still unclear.
1:45 p.m.3:30 p.m.: Storage 1 (Chair: Armando Fox; scribe: David Oppenheimer)
Why Can't I Find My Files? New Methods for Automating Attribute Assignment
Craig A. N. Soules and Gregory R. Ganger, Carnegie Mellon University
Craig Soules described new approaches to automating attribute
assignment to files, thereby enabling search and organization tools
that leverage attribute-based names. He advocates context analysis to
augment existing schemes based on user input or content analysis. Context
analysis uses information about system state when the user creates and
accesses files, using that state to assign attributes to the
files. This is useful because context may be related to the
content of the file and may be what a user remembers when searching
for a file. The Google search engine has proven the usefulness of
context analysis; it chooses attributes for a linked site by examining the
text associated with the link, and it considers user actions after a search
to decide the user's original intent in the search. However, the kind
of information Google's context analysis relies on cannot be applied
directly to filesystems. In particular, information such as links
between pages does not exist in traditional filesystems, and
individual filesystems do not have enough users or enough "hot
documents" to make Google-like context statistics useful.
Soules described access-based context analysis, which exploits
information about system state when a user accesses
a file, and
inter-file context analysis, which propagates attributes among related
files. The former relies on application assistance or existing user
input (e.g., filenames), and the latter relies on observing temporal
user access patterns and content similarities and differences between
potentially related files and versions of the same file. Based on a
trace analysis of usage of a single graduate student's home directory
tree over a one-month period, Soules concluded that a combination of
the techniques he proposes could be useful for automatically assigning
attributes. For example, a Web browser can relate search terms to the
document the user ultimately downloads as a result of the search,
files created and accessed in a single text-editor session can be
considered related, and attributes about documents used as input to a
distiller such as LaTeX or an image manipulator program can be
distilled for attachment as attributes of the output file. Soules also
found that examining temporal relationships between file accesses in
the trace successfully grouped many related files.
Soules stated that he will be investigating larger user
studies, mechanisms for storing attribute mappings, appropriate user
interfaces, and how to identify and take advantage of user context
switches, i.e., a user's moving from using one program to another.
During the question and answer period, Timothy Roscoe observed that tools such as the old AltaVista Personal Edition can
index filesystems reasonably effectively. Andrew Hume pointed out that
Soules' system does not address situations in which there are multiple
access hierarchies for the same underlying filesthe original
hierarchy is effectively locked in, making it hard to handle multiple
filesystem views simultaneously. Soules responded that the focus of
this paper was how to get the keywords, and that he has not yet
considered how to present the keywords or present multiple
views. Hume also suggested looking into software that produces
high-quality indexes for printed books.
Secure Data Replication over Untrusted Hosts
B.C. Popescu, B. Crispo, and A.S. Tanenbaum, Vrije Universiteit, Amsterdam, The Netherlands
B. C. Popescu described a system architecture that allows arbitrary
queries on data content that is securely replicated on untrusted
hosts. This system represents an improvement over systems based on
state signing, which can support only semi-static data and predefined
queries, and systems based on state machine replication, which require
operations to be replicated across multiple machines. In the authors'
system, every data item is associated with a public/private key pair.
The private key is known only to the content owner, and the public key
is known by every client that uses the data. There are four types of
servers: master servers, which hold copies of content and are run by the
content owner; slave servers, which hold copies of data content but are
not controlled by a content owner and thus are not completely trusted;
clients, which perform read/write operations on content; and an
auditor server, described later. The master servers handle client
write requests and lazily propagate updates to slave servers. Master
servers also elect one of themselves to serve as an auditor, which
performs background checking of computations performed by slaves,
taking corrective action when a slave is found to be acting
maliciously. Slave servers handle client read requests; they may use
stale data to handle requests, but clients are guaranteed that once the
time parameter maxLatency has passed since a write was committed by a
master, no other client will accept a read that is not dependent on
the write. All content in the system is versioned; the content version
of a piece of data is initialized to zero when it is created and is
incremented each time the data item is updated.
The key challenge in building this system is to enable clients to feel
safe having their queries handled by untrusted slave hosts. This is
accomplished probabilistically, by allowing clients at any time to send the same
request to a (trusted) master and (untrusted) slave
and to compare the results. When a slave returns the result of a read,
it attaches a signed "pledge" packet containing a copy of the request,
the content version timestamped by the master, and the secure hash of
the result computed by the slave. If the slave returns an incorrect
answer, the "pledge" packet can be used as proof of the slave's
malfeasance. This probabilistic checking mechanism is augmented by an
auditing mechanism in which, after a client accepts a result from a
slave, it forwards the slave's "pledge" packet to a special auditor
server. The auditor server is a trusted server that does not have a
slave set but just checks the validity of "pledge" packets by
reexecuting the requests and verifying that the secure hash of the
result matches the secure hash in the packet. The auditor is expected
to lag behind when executing write requests, executing writes only
after having audited all the read requests for the content version
preceding the write.
During the question and answer period, Peter Druschel asked what
happens to a slave that has been detected to have cheated. Popescu
replied that the action is application-specific; once the slave has
been taken out of the system, out-of-band means such as a lawsuit can
be employed. Rob von Behren asked why you wouldn't need as many
auditors as masters. Popescu responded that auditors should be more
efficient than masters, because they don't have to communicate back to
clients and they can take advantage of techniques such as query
optimization, batching, and result reuse. Rob von Behren suggested
that the system might be designed to let auditors be untrusted, with
numbers of them used in a Byzantine agreementtype
configuration. Popescu agreed that this would be a stronger
guarantee, but he pointed out that the only weakness in the system as
it stands is that an auditor could collude with a slave.
Palimpsest: Soft-Capacity Storage for Planetary-Scale Services
Timothy Roscoe, Intel Research at Berkeley; Steven Hand, University of Cambridge Computer Laboratory
Timothy Roscoe described Palimpsest, a "soft-capacity storage service
for planetary-scale applications." Palimpsest is designed to serve as
a storage service for ephemeral data from planetary-scale applications
running on a shared hosting platform such as PlanetLab, XenoServers, or
Denali. Examples of the type of data to be stored are static code and
data for services, application logs of various sorts, and ephemeral
system state such as checkpoints. Despite the temporary nature of the
data, it must be highly available during its desired lifetime, thus
making single-node local disk storage unsuitable. Traditional
filesystems such as NFS and CIFS provide facilities unnecessary for
planetary-scale applications and don't meet service provider
requirements such as space management, billing, and security
mechanisms that allow users to store their data on a shared
infrastructure without having to trust the infrastructure
provider. Palimpsest aims to provide high data availability for
limited periods of time, data protection and resistance to Denial of
Service attacks, flexible cost/reliability/performance tradeoffs,
charging mechanisms that make sense for service providers, capacity
planning, and simplicity of operation and billing. To achieve these
goals it uses soft capacity, congestion-based pricing, and automatic
To write a file, a Palimpsest client erasure-codes the file, encrypts
each resulting fragment, and sends each encrypted fragment to a
fixed-length FIFO queue at the Distributed Hash Table node
corresponding to the hash of the concatenation of the filename and the
fragment identifier. To retrieve a file, a client generates a
sufficient number of fragment IDs, requests them from the block
stores, waits until a sufficient number of the requested fragments have been
returned, decrypts and verifies them, and recreates the original
file. Because the queues are fixed-length, all files stored in
Palimpsest are guaranteed to disappear eventually. To keep a file alive
the client periodically refreshes it, and to delete the file the
client simply abandons it.
The key to predictable file lifetimes is
the time constant (T) associated with each block store. T measures
how long it takes a fragment to reach the end of the queue and
disappear. Clients piggyback requests for information about T on read
and write requests, and block stores provide a recent estimate of T by
piggybacking on responses. Palimpsest providers charge per
(fixed-length) write transaction. Clients can use information about
each block store's T value and fee per write transaction to flexibly
trade off data longevity, availability, retrieval latency, and
robustness. Clients pay providers using anonymous cash
transactions. DoS attacks are discouraged by charging
for writes. Providers can perform traffic engineering by advertising
values of T that deviate from the true value. Congestion pricing is
used to encourage users to attain an efficient write rate.
During the question and answer period, David Wetherall asked about
the possibility of providing a callback to clients when a block they
have stored is about to disappear. Roscoe responded that this would add the
complexity of having to know about the blocks that are being
stored. Sameer Ajmani observed that a load spike might change T
significantly. Roscoe agreed; it is up to clients to factor in their
belief as to how much the time constant will change when they store
their data. Ethan Miller observed that for enough money, a malicious
hacker could make people's data disappear. Roscoe replied that one way
around that situation would be to dampen the rate at which the time constant
could change, by limiting the write rate. Benjamin Ling asked what would
happen if a provider lied, advertising a larger T than was true.
Roscoe responded that third-party auditors can monitor the time
constant and provide a Consumer Reports-like service to evaluate
providers' claims. Finally, Jay Lepreau asked whether nonmalicious
traffic might cause temporary load spikes, for example, an intrusion
detection service using Palimpsest at the time a distributed worm such
as SQL Slammer was unleashed. Roscoe replied that this would be a
problem, but that hopefully the Palimpsest pricing scheme could help
During the general question and answer period following the session's
talks, Andrew Hume asked Timothy Roscoe whether a simpler scheme than
Palimpsest could be used to store ephemeral files just like regular
files and cycle them using a generational scheme, eventually deleting
the files that graduated from the oldest generation. Roscoe responded
that Palimpsest provides an easy charging and pricing mechanism, while
standard network filesystems such as NFS do not. Val Henson asked Popescu
what he thought of the "High Availability, Scalable Storage, Dynamic
Peer Networks: Pick Two" talk, in light of the fact that his system is
targeted toward storing data on untrusted hosts. Popescu responded
that they're more interested in environments in which hosts are less
transient than in standard peer-to-peer networks. Jeff Chase observed
that Palimpsest clients have no control over the placement of their
data, and he asked Roscoe whether he thought that was
significant. Roscoe responded that the ability to select specific nodes on which
to store data could be provided by an orthogonal mechanism. Benjamin
Ling asked whether widespread adoption of Palimpsest would be hindered
by the lack of hard guarantees about data longevity. Roscoe responded
that legal contracts akin to service level agreements could be layered
on top of Palimpsest to ease users' concerns about the inherent risk
of data loss. Furthermore, this is really a futures market:
third parties can charge premiums for providing guarantees and taking
on the risk of data loss themselves.
4:00 p.m.5:45 p.m.: Trusting Hardware (Chair: Dan Wallach; scribe: Matt Welsh)
This session turned out to be the most controversial of the conference,
as two of the three talks discussed the use of secure hardware and
systems such as Microsoft's Palladium architecture.
Certifying Program Execution with Secure Processors
Benjie Chen and Robert Morris, MIT Laboratory for Computer Science
Benjie Chen is motivated by potential uses for trusted computing
hardware other than digital rights management. Since all PCs may
include this hardware in the future, he is interested in exploring the
hardware and software design for such systems. His running example was
secure remote login, such as from a public terminal at an Internet cafe,
where the client machine can attest to the server that it is running
unadulterated software (OS, ssh client, etc.) Of course, this does not
preclude low-tech attacks such as a keyboard dongle that captures
keystrokes, but that is outside the immediate problem domain.
Benjie presented an overview of the Microsoft Palladium, or Next
Generation Secure Computing Base, architecture, which uses a secure
"Nexus" kernel and a secure chip that maintains the fingerprints of the
BIOS, bootloader, and Nexus kernel. A remote login application would
send an attestation certificate, generated by Nexus and the secure chip,
to the server. The issues here are how to keep the Nexus kernel small
and how to verify OS services such as memory paging or the network
stack. Some ways to improve Palladium's security and verifiability were
discussed, such as using a small microkernel that allows attestation of
all OS modules above it, as well as a flexible security boundary (where
some, not all, of the OS modules are verified). There is a connection
with the XOM secure processor work, which prevents physical attacks on
DRAM by storing data in encrypted form in memory and only decrypting it
into a physically secure cache. Borrowing some of these ideas, one could
run the microkernel within the secure processor, which would authenticate all
data transfers to DRAM, and the application, hardware drivers, network
stack, etc., could all be encrypted in DRAM.
Constantine Sapuntzakis asked whether Nexus was not a microkernel.
Benjie answered that Nexus implements many drivers inside, which raises
the problem of how to verify them all and still keep the security boundary
small. Constantine noted that one difficulty with XOM is that replay
attacks are possible, and Benjie agreed.
Mendel Rosenblum said that secure remote login is not a good example
application, since it doesn't solve the overall security problem (for
example, a camera looking over the user's shoulder can capture
keystrokes). Benjie agreed that not everyone believes in this
application, but there are other applications that drive their design.
Jay Lepreau argued that the real application for this work is
DRM. Benjie said they will continue to support DRM, but they are
interested in exploring other applications as well.
Constantine pointed out that XBOX is supposedly secure hardware, but a
buffer overrun exploit was recently discovered. Benjie replied that
they haven't tried to address this problemusing attestation just
tells you that the code you thought was running is indeed running. It also
turns out that attestation can't deal with the code being modified at
Hardware Works, Software Doesn't: Enforcing Modularity with Mondriaan Memory Protection
Emmett Witchel and Krste Asanovic, MIT Laboratory for Computer Science
Emmett proposed efficient, word-level memory protection
to replace page- or segment-based protection mechanisms.
This is motivated by the use of fine-grained modules in software, with
narrow interfaces both in terms of APIs and memory sharing. He argued
that safe languages are not the answer, in part because it is difficult
to verify the compiler and runtime system. Rather, allowing hardware
protection to operate at the word level is much simpler and permits a
wide range of sharing scenarios. For example, when loading a new device
driver into the kernel, the MMP hardware would be configured to permit
the module to access its own code and data, as well as to make calls to
other modules and share very fine-grained memory (e.g., part of a larger
data structure in the kernel). Some of the challenges involve
cross-domain calls through call gates; dealing with the stack (since no
single protection domain "owns" the stack); and automatically
determining module boundaries through information already present in
code, such as symbol import/export information in kernel modules.
Other potential uses include elimination of memory copies on system
calls, specialized kernel entry points, and optimistic compiler optimizations
(e.g., write-protecting an object and running cleanup code if a write fault
Mike Swift asked for clarification of the situation where a cross-module call
might have multiple targets, as with dynamic dispatch. Emmett replied
that the call gate is on the first instruction of the target routine,
not the call site itself.
George Candea asked whether the difference between this technique and
software fault isolation is simply performance. Emmett answered that SFI
is great if you have a very stylized problem, but it has high overhead and
has issues with protecting multiple discontiguous regions
of data. Also, SFI is just not fastfor Emmett's tests, Purify yields
a slowdown of 1050x, but with MMP it's less than 10%.
Timothy Roscoe asked about the relationship to the SOSP '99 paper on the
EROS system, which had a fast capability system running on normal PCs
with revocation. Emmett replied that EROS had a nice design, and that
the big difference is the level of indirection and how to make it really
Eric Brewer asked how this approach relates to Multics. Emmett responded
that essentially this is a way to do Multics for multi-GHz processors in
a modern context. Mendel Rosenblum noted that the people who designed x86
segmentation paid attention to the Multics design, but what they built
was the unusable-to-most-software x86 segmentation hardware.
Furthermore, he believed that if they looked at Mondriaan, they would
pervert it into something bad. Emmett didn't find this a substantive
criticism, believing that Intel engineers are more careful these days.
Jay Lepreau argued that MMP does not solve the larger protection domain
issue. For example, a thread may call into another module which ends up
spinning in an infinite loop or deadlocking. Emmett responded that this
is a not a problem that he was attempting to address; he was focusing
on the memory protection issue alone.
Flexible OS Support and Applications for Trusted Computing
Tal Garfinkel, Mendel Rosenblum, and Dan Boneh, Stanford University
Tal's talk returned to the question of using secure hardware for
applications other than DRM, and shared much of the motivation and
background of Benjie's talk. The core problem with open platforms (as
opposed to closed platforms, such as ATMs and cell phones), is that
applications can be deployed across a wide range of existing hardware
but it is difficult to manage trust. Tal proposed the use of
virtual machine monitors to provide a trusted
OS environment. The VMM can run either an "open box" VM
(such as a standard OS) or a "closed box" VM (a trusted system).
For example, one closed box VM might be a virtual Playstation game
console, which uses attestation to prevent cheating in a multiplayer game.
Another potential application is a customized distributed firewall
pushed into the VMM to protect the
network from the host, e.g., by preventing port scanning or IP
spoofing or by enforcing connection rate limits. Tal also discussed
applications to reputation systems and third-party computing (a la SETI@Home).
He concluded the talk with a review of current efforts in this area,
including TCPA, Palladium, and LaGrande.
Mike Swift admitted that he is not a convert to the VMM concept, claiming
that this approach requires that the OS has no bugs.
Tal's response was that attestation tells you what the software stack is,
allowing you to have more confidence in the system, but Mike countered
that this does not make it inherently more robust. Mendel raised
the point that the idea behind VMMs is not to have one secure OS
(for example, so you don't need to use Nexus necessarily), but to allow
different people to use different secure OSes. Jay Lepreau didn't buy the
argument that VMMs are simpler, and wondered how much code there is in
VMware. Mendel explained that for architectures that are virtualizable
(which, unfortunately, does not include the x86), you can get away with
about 10,000 lines of code, not including device drivers. Mendel would
expect future architectures to be more virtualization-friendly.
The political issues surrounding trusted computing platforms raised a
number of interestingand heatedquestions from the audience.
Dan Wallach started off by describing the recent XBOX hack, where that
system was supposedly trusted hardware. Tal and Andrew Hume countered that
there are no guarantees of correctness, just tradeoffs in terms of risk
assessment. Mendel suggested that cheating at Quake was not a big
concern for industry, so this was not of paramount concern in the XBOX.
Timothy Roscoe was disturbed that the three speakers seemed to be too
much in agreement, and raised the question of big protection domains
(e.g., VMs) versus tiny protection domains (e.g., Mondriaan memory
protection). They didn't take the bait, thoughTal said that these
approaches were not mutually exclusive.
Jay Lepreau jumped in at this point and raised the concern that
nobody has yet demonstrated an entire (real) operating system based on
the microkernel model. He again argued that MPP does not solve the whole
domain protection issue, since control flow protection is just as important
as memory protection. Emmett admitted that there is some complexity
involved, but by making memory protection domains both finer-grained and
more explicit, programmers have to think about them more carefully and
will document them in the code. Jay argued that chopping up a monolithic
system in a "half-assed way" makes it more complex, but Emmett countered
that most systems software is already making use of well-defined
modules, simply without strong protection between them.
Val Henson returned to the trusted computing discussion, arguing that
these platforms are more useful for restricting rights (as with DRM)
than for giving us more rights. Benjie argued that the bleak view is
that this hardware is going to get pushed out regardless, so we should
be considering how to use it for something other than DRM. Tal concurred
and said that this work was also about intellectual honesty.
Jay Lepreau argued that Tal and Benjie were really just giving cover to
the real commercial driver for this technology, which is DRM.
Mike Swift wanted to know where the line between research and the
corporate world should be drawn, and whether our community should
support this work at all. Tal mentioned again that their work is just
bringing more information to the table, but Mike drew an analogy with
nuclear weapons research, claiming that expanding knowledge is not the
only side effect of this work.
Dirk Grunwald wondered whether calling this "trusted computing" was like
the rebranding of "out of order execution" to "dynamic execution"trusted computing cannot deal with hardware attacks, so consumers may be
misled into believing it's safer. Tal pointed out that this is no
different from calling a protocol "secure."
Jeff Mogul made the point that technology tends to reinforce existing
power structures, and that the issue with trusted computing is not
security but, rather, whether the technology reinforces existing power
relationships or diffuses them. He summarized, "Are you
advocating a system that helps the powerful or that helps the weak?"
Eric Brewer claimed that he didn't want "trusted" software on his PC,
since it only enforces a particular kind of trust relationship between
two parties. He would rather have control over his trust relationships,
but trusted computing platforms remove that control. Tal admitted that
there is a real concern about using Palladium as an industry control point.
Timothy Roscoe wrapped up the session by pointing out that this discussion
had been rather "sterile" and had not touched on any of the issues of
politics, lawmaking, or market forces surrounding the DRM and trusted
Wednesday, May 21, 2003
9:00 a.m.10:25 a.m.: Pervasive Computing (Chair: Geoff Voelker; scribe: Ranjita Bhagwan)
Sensing User Intention and Context for Energy Management
Angela B. Dalton and Carla S. Ellis, Duke University
Angela Dalton talked about employing low-power
sensors to monitor user behavior and then using this information to reduce system energy
consumption. She described a case study called faceoff,
which uses a camera to perform image capture. This is followed by
face detection, and the information is fed into a control loop that
does the system-level energy management by turning off the display.
The authors have built a prototype which uses image capture
and skin detection. A feasibility study to
determine the best-case energy savings faceoff can provide
showed that significant savings are possible. Also,
an important issue is responsiveness in the system. The paper discusses
various ways of increasing responsiveness and describes
several optimizations with that objective.
Geoff Voelker asked if the authors had considered use of the same
techniques for security, such as face recognition rather than face
detection. Angela agreed that the system offers other benefits,
such as privacy, but they have not yet explored them.
Mike Swift asked how robust skin detection is, and whether the detector
can be confused. Angela said that the system should detect different skin
colors, but it is not a good way of doing face detection. Ideally, you
would use something more than skin detection for this purpose. However,
it sufficed for the case study.
Access Control to Information in Pervasive Computing Environments
Urs Hengartner and Peter Steenkiste, Carnegie Mellon University
Locating people in a pervasive computing environment can be done at
many levels and many granularities. An access control
mechanism is required. However, access control mechanisms for
conventional information cannot be used as is for such environments.
Urs Hengartner described a people locator service that uses
digital certificates for defined location policies. The authors use
three design policies: identify information early, design policies
at the information level, and exploit information relationships. Their
approach is to use a step-by-step model, where validation of a client
node is done at every server node. They also use a policy description
language and an information description language in their people
Sameer Ajmani brought up the issue of certificate systems being an
administrative nightmare. He said that this is going to create lots
of privacy issues, and that lots of trust issues are assumed. Most
individuals are not going to even look at the policies. In response,
Urs said that policy generation is done behind the scenes. The front end
can be designed depending on the user. It could be a Web browser for an
individual, while it could be something else for service developers.
Andrew Hume said that these systems break the notion of
practical obscurity. Urs said that you could filter data so
that a single node could not do serious damage.
Margo Seltzer asked if the speaker could compare this access control
mechanism to that of the mobile people architecture at Stanford. Urs
said he was not aware of the Stanford system.
Tal Garfinkel said that the information in the system would be regulated
by many different entities. The speaker responded that there are not going
to be that many entities. For example, a university runs the service,
and you have to trust them anyway.Tal asked if there is any way to
shield information in this scenario. The speaker said this could be
done only in a legal way.
* Privacy-Aware Location Sensor Networks
Marco Gruteser, Graham Schelle, Ashish Jain, Rick Han, and Dirk Grunwald, University of Colorado at Boulder
Marco Gruteser gave a brief description of a system that
uses sensor networks to gather information, while making sure that
some degree of anonymity is maintained. Describing the problem, Marco
said that sensor networks could be used to identify the precise location
of certain entities, which could be undesirable. With an anonymous
and the risk of accidental data disclosures is reduced, since databases
no longer store the information. Marco described
k-anonymity in database systems and said that the concept could be applied
to location information.
Ethan Miller asked how this interacts with finer-grained data
than location information. Would k-anonymity
still be maintained? Marco said that the current work focuses
on location information at this granularity. If you look at different
pieces of information, the problem becomes more complex.
Margo Seltzer asked whether k-anonymity can be preserved when you
aggregate location information obtained from different sensor
networks. Marco said that with multiple sensor networks, maintaining
k-anonymity could be a problem.
Andrew Hume said that k-anonymity is not sufficient
to preserve anonymity. He suggested that Marco consult
statisticians about a better means for doing this. Andrew also asked, for
reasonable values of k, what applications would this have? Marco
responded that for k=10, accuracy is on the order of 100 meters. For many
location-based services, that would be sufficient, for instance,
detecting whether a meeting room is busy.
Ethan Miller asked how one decides a good value of k. Marco said that
it depends on user preferences and on circumstances.
Sameer Ajmani brought up a problem: where one trusts sensor
networks, for example, in an office, one usually does not care about anonymity. On the other hand, outside the office, one would not
trust a sensor network. Marco suggested a certification
process, in which you trust the company building the sensor network, or
you trust the government using the sensor network.
Essentially, you need to trust the people who deploy the sensor
Bogdan Popescu asked whether the sensors need to talk to each other.
Marco said yes, they assume a standard multi-mode wireless
sensor network, such as Berkeley motes.
Urs asked Angela whether, when there are multiple windows open and only one
active, has she thought about turning off only parts of the screen?
Angela said that there is related work that deals with energy adaptive
displays. Presently this cannot be done. But you could use some form
of gaze tracking to do this. Andrew Hume asked whether the partial
screen turn-off works at the hardware level. Angela responded that
currently there is no hardware that does that. New kinds of displays
are assumed. But you would still need to control it through the system.
Andrew Hume commented that this could be used for covert operations. For example, if a laptop screen shuts down, it means that no one is close to
it. Is there a security aspect to that, because you can detect location
to some extent? Angela said that larger networks could do this, and, yes,
there are lots of security implications.
Val Henson asked what the trend is for built-in sensors. Angela replied
that most devices can now have built-in cameras. In general, these
sensors are low-power, are cheap, and are becoming pervasive, especially
Andrew Hume asked what kind of resolution of the camera is needed to
make this work well. Angela said that detection currently
is skin-color-based, and you could do this even with a low-resolution
Geoff Voelker asked Urs if he had thought of foreign users coming into
an administrative domain. Urs responded that since they use SPKI/SDSI
digital certificates, they do not require a PKI and
could give access to anyone.
Val Henson asked Marco how you decide the shape of the location. Marco said that the current assumption is that the sensors are
preconfigured in a room with a preset room ID; the same applies to
a building, and so on.
Jay Lepreau asked Angela whether she had data on how people use laptops,
so that she could evaluate how well her scheme would work with
user behavior patterns. Angela said that more and more people are
using laptops as their primary system. Moreover, energy awareness is
important generically. However, this question is valid for cell phones,
PDAs, etc., which could have widely varying usage patterns. Jay said
that the power management on his laptop is frustrating, because it
is stupid. It would be good to have a diagnostic tool, with the user
being able to guide the system to some extent. Has Angela considered
providing the user with a diagnostic tool? Angela said that you could imagine
a user interface, which could be used to measure the annoyance factor
of the power management, and she agreed that things kicking
in at the wrong time can cause problems.
10:30 a.m.11:15 a.m.: Storage 2 (Chair: Margo Seltzer; scribe: David Oppenheimer)
* FAB: Enterprise Storage Systems on a Shoestring
Svend Frølund, Arif Merchant, Yasushi Saito, Susan Spence, and Alistair Veitch, Hewlett Packard Laboratories
Alistair Veitch of HP Labs presented "FAB: Federated Array of Bricks." He
described a project that is aimed at making a set of low-cost storage
bricks behave, in terms of reliability and performance, like an
enterprise disk array, but at lower cost and with greater scalability.
The FAB array is built from bricks each of which consists of a CPU,
memory, NVRAM, a RAID-5 array of 12 disks, and network cards. Clients
connect to the array using a standard protocol such as iSCSI, Fibre
Channel, or SCSI, and the storage bricks communicate among
themselves using a special FAB protocol running on Gigabit
Ethernet. The requirements of the array are zero data loss, continuous
availability, competitive performance, scalable performance and
capacity, management as a single entity, online upgrade and
replacement of all components, and higher-level features such as
efficient snapshots and cloning. The principal research challenges are
failure tolerance without losing data or delaying clients,
asynchronous coordination that does not rely on timely responses from
disks or the operating system, and the ability to maximize performance
and availability in the face of heterogeneous hardware. The techniques
FAB incorporates to achieve these goals are a quorum-based replication
scheme, dynamic load balancing, and online reconfiguration.
After outlining FAB's goals and the high-level techniques used to
achieve those goals, Veitch described the quorum-based replication scheme
used to achieve reliability. It uses at least three replicas for each
piece of data, and reads and writes a majority of replicas. It
survives arbitrary sequences of failures, achieves fast recovery, and
can be lazy about failure detection and recovery, as opposed to needing
to do explicit failover. The array configuration that the designers
envision achieves a mean time to data loss of about a million years,
which Veitch described as "at the bottom end of what's acceptable."
George Candea asked whether Veitch could estimate the cost savings of FAB
over traditional enterprise disk arrays. Veitch estimated a 2030%
cost savings but emphasized that it depends on the
development time and cost. Dan Wallach asked whether today's enterprise
disk arrays are truly expensive to develop and produce, or whether the
prices of arrays are really just a huge markup over the prices of
disks. Veitch replied that disks really are expensive, especially the
kinds used in enterprise arrays, such as dual-ported Fibre Channel
disks. He added that customized hardware ASICs and integration with
firmware contribute to the six-year development cycle between
product inception and delivery, requiring work by hundreds of
people. FAB is trying to save money by doing everything possible in
* The Case for a Session State Storage Layer
Benjamin C. Ling and Armando Fox, Stanford University
Benjamin Ling presented "SSM, a
recovery-friendly, self-managing session state store." SSM is specialized for storing session state typically
associated with user interactions with e-commerce systems. The system
assumes a single user making serial access to semi-persistent data. Ling
explained that existing solutions such as filesystems, databases, and
in-memory replication exhibit poor failure behavior or recovery
performance, or both, and that they are difficult to administer and
tune. SSM is designed to be recovery-friendly, that is, it can recover
instantly without data loss, and it is self-managing in terms of handling
overload and performance heterogeneity of components.
SSM is based on a redundant, in-memory hash table distributed across
nodes, called bricks, and stateless client stubs, linked with
application servers, that communicate with the bricks. Bricks consist
of RAM, CPU, and a network interface (no disk). SSM uses a redundancy
scheme similar to, but slightly different from, quorums. On writes, a
stub writes to some N random nodes (4, in Ling's example) and waits for
the first M of the writes to complete (2, in Ling's example); this
scheme is used to avoid performance coupling. The remaining writes are
ignored. Reads are issued to a single brick. SSM is recovery-friendly in that no data is lost so long as no more than M 1 disks in
a write quorum fail. Moreover, state is available for reading and
writing during a brick failure. SSM is crash-only, using no
special-case recovery code: when a brick is added to the system or
returns to service after a failure, it is simply started up without
any need to run a recovery procedure. SSM is self-managing in that
it dynamically discovers the performance capabilities of each brick,
as follows. Each stub maintains a count of the maximum allowable
inflight requests to each brick, using additive increase to grow this
window upon each successful request and multiplicative decrease to
shrink the window when a brick operation times out. If an insufficient
number of bricks are available for writing, either due to failures or
due to a full window, the stub refuses the request issued by the
client of the stub, thereby propagating backpressure from bricks to
Matt Welsh asked about the capacity of the system, and in particular
the expected lifetime of the data to be stored in the system,
especially in light of the absence of disks. Ling replied that session
state generally lives a few hours to a few days. It is usually about
3KB200KB per user, and Ling estimated that 1015 bricks would be able
to support 400,000 users. Andrew Hume asked what is done with the N M
requests that are not among the first M writes to complete. Ling
explained that these requests are just forgotten, and that they are
handled using a garbage-collectionlike mechanism.
* Towards a Semantic-Aware File Store
Zhichen Xu and Magnus Karlsson, HP Laboratories; Chunqiang Tang, University of Rochester; Christos Karamanolis, HP Laboratories
Zhichen Xu described pStore.
He explained that storage systems are in some ways an extension to
human memory, but that computer-based storage systems have been "dumb"
because, unlike humans, they do not associate meanings (semantics) with
the stored data. For example, when humans search, they
consider abstract properties and relationships among objects, and when
they store data, they may group objects into categories or record
only the differences between objects. Moreover, humans discover the
meanings of objects incrementally. These observations motivate the
incorporation of versions, deltas, and dependencies among objects
stored in the semantic-aware file store.
The semantic-aware file store is intended to create a framework
that captures and uses semantic information for fast searching and
retrieval of data; stores data efficiently by using data compression
based on semantic relationships of data; provides high performance
through semantic data hoarding, data placement, replication, and
caching; and enables highly available data sharing by balancing
consistency and availability according to the data semantics. The
semantic-aware file store uses a generic data model based on RDF to
capture semi-structured semantic metadata. The challenges Xu described were how to identify the common semantic relations of interest, finding
ways to capture semantic information, how to handle dynamic
evolution of semantics, and how to define the tools and APIs that users and
George Candea asked Xu to compare his system to existing products that
store and track metadata about files. Xu said the main difference is
that those systems store metadata in a database, which has a fixed
schema, whereas his system uses semi-structured data and therefore
allows evolution of semantics. Val Henson asked whether the
semantic-aware file store can do better than Citeseerif she had a
number of research papers in her home directory, would the
semantic-aware file store be able to relate them to each other as well
as Citeseer does? Xu said the answer depends on the tools
used to extract the information from the papers; pStore is a framework
into which you could plug any number of data-extraction tools.
During the discussion that followed the talks in this session, Margo
Seltzer asked Veitch whether in ten years HotOS will have a paper
explaining why FABs are just as expensive as today's storage
arrays. Veitch answered, "I hope not, but who knows?" Andrew Hume
asked Veitch whether the FAB design makes it difficult to predict
performance, as compared to a disk connected to a single machine,
especially in the face of failures due to the large number of
potential interactions among bricks in the FAB. Veitch responded that
the performance can in fact be modeled as a function of factors such
as the overhead of doing a disk I/O, the number of nodes, the desired mean
time to data loss, and so on. He added that even today no storage
system gives you hard-and-fast performance guarantees and that the
more difficult question is how to model overall system reliability and
the optimal tradeoffs among design parameters. Rob von Behren asked
Veitch whether he had considered using non-RAID disks instead of RAID
arrays inside each brick. Veitch responded that by using RAID, the
mean time to data loss is controlled by the failure rate of each brick
rather than the failure rate of individual disks, thereby increasing
reliability. He added that it's very difficult to get good numbers
on actual failure rates of system components, so estimating overall
reliability is difficult.