Check out the new USENIX Web site. USENIX - Summaries


USENIX Windows NT Workshop

SEATTLE, WA

August 11-13, 1997


KEYNOTE ADDRESS

Windows NT to the Max ­ Just How Far Can It Scale Up

Jim Gray, Microsoft Bay Area Research Center


Summary by Brian Dewey

Jim Gray of Microsoft's Bay Area Research Center delivered the first keynote address of the workshop. As one of my colleagues remarked, his talk made him sound like he was from the Bay Area Marketing Center ­ it was woefully short on technical content. However, Gray was able to put together a persuasive argument that NT is a solid choice for large-scale computing.

Interestingly, Gray conceded to the conventional wisdom that UNIX scales and NT doesn't early in his talk. He even provided his own "incontrovertible" evidence: a graph displaying throughput of the TPC-C benchmark on a variety of platforms over the past 18 months. Using this metric, UNIX servers have not only consistently performed better than NT servers; UNIX servers have been increasing their throughput at a faster rate.

This seems to be pretty damning evidence. However, Gray argued that the performance difference stemmed almost entirely from hardware. UNIX servers run on better hardware than NT. To prove this point, Gray showed the performance of UNIX vs. NT on a six-processor Intel platform. In this chart, NT had comparable absolute performance and scaled better than the UNIX software. Thus, when comparing just the software and not the hardware, NT shines.

Although this argument sounds like it's trying to put the blame on the hardware manufacturers, as much of the blame must go to the Microsoft marketing strategy ­ NT is targeted for the commodity hardware market. There's a plus side to this: NT is a cheaper solution. Gray pointed this out as well. In a chart as damning to the UNIX advocates as his first one was to Microsoft, he illustrated that an NT solution possesses economy of scale; that is, as you spend more to get higher throughput, your cost per transaction drops. However, the UNIX solutions have a diseconomy of scale. As you kept spending more money to approach the impressive throughput numbers cited in the benchmarks, your cost per transaction would keep increasing. This fact makes NT a more practical solution for most problems.

How large of a handicap is it to NT platforms to not run on the latest, coolest hardware? Not much, Gray argued. The bulk of his talk presented case studies of impressive computational feats performed by NT systems. He warmed up with the standard laundry list: NT has been used to power systems that handle 100 million Web hits per day, 10,000-user TPC-C benchmarks, 50GB of Exchange store, 50,000 POP-3 users sending 1.8 million email messages per day ­ the list goes on. [Editor's note: See Neil Gunther's article on page 9.]

However, the most elaborate example he presented of NT's ability to scale up (i.e., scale to a more powerful, single computer) was the MS Terra Server ­ Microsoft's proof-of-concept terabyte database. Gray told a delightful tale of how difficult it was to come up with a terabyte of data in the first place, especially data that had to meet the conflicting criteria of being interesting to everybody and not offensive to anybody. Microsoft settled on storing both USGS and Russian satellite imagery in the database. Information from the Expedia world atlas allowed searching. The Web interface to the database was fast and cool; this was an impressive demo with a high "wow" quotient.

Gray also presented a detailed case study of NT's ability to scale out to multiple machines: the billion-transactions-per-day simulation that they developed for the Microsoft Scalability Day show in New York. The BTD demo involved 45 computers (140 CPUs) that ran the TPC-C credit/debit simulation and maintained one billion transactions over the course of one day. Why should we be impressed with one billion transactions? In contrast, Gray told us that Visa handles only 20 million transactions per day.

Although UNIX servers possess a throughput edge, the combination of economics and NT's capabilities make it a formidable platform. However ­ and this is where things sound suspiciously like an infomercial ­ that's not all! There's an impressive array of technologies that Microsoft is building in to future versions of NT. Gray briefly listed these: NT 5.0 will have 64-bit memory addressing; Wolfpack provides 2-node failover; a Coda-like filesystem will allow disconnected access to network files; and Hydra provides timesharing on an NT server.

Even though I didn't like the marketing overtones of this talk, I must admit that the reason why most people were at that workshop was precisely that NT is becoming a dominant force in the market. People cannot afford to ignore it. Gray's talk, while not what I'd been hoping for, was probably the perfect introduction to the three-day workshop.

REFEREED PAPER SESSION

Mangling Executables


Summary by Brian Dewey

This session presented four papers dealing with some aspect of binary rewriting on the NT platform. Overall, this was an excellent session; each speaker did a superb job, and the material was interesting. One project, Etch, discussed binary rewriting as a way to analyze and optimize Win32 programs. However, Etch is heavy on the analyze and short on the optimize. Two other papers dealt with optimization in greater depth. Spike is a project from DEC that performs profile-driven optimization of Alpha NT code. Brad Chen of Harvard University presented a paper on Just-in-Time Code Layout; this technique uses a new heuristic for code layout that does not require profiling information. Finally, Anton Chernoff presented Digital's DIGITAL FX!32, a project that allows the execution of x86 binaries on Alpha NT platforms using a combination of emulation and binary rewriting.

Etch

Brad Chen of Harvard University presented Etch, a research project undertaken jointly by Harvard and the University of Washington. It is a tool for instrumenting Win32 binaries, similar to Atom (a tool from DEC that instruments Alpha binaries) and HiProf (a commercial product from TracePoint). By providing a simple interface, Etch tries to hide much of the complexity of the Win32 environment from programmers.

From the programmer's perspective, there are two major steps to using Etch: instrumentation and analysis. During the instrumentation phase, a simple API and series of callbacks give the programmer the opportunity to insert calls to analysis routines at the instruction, basic block, and procedure levels. The analysis routines are the code that will be called during the execution of the instrumented program.

Chen gave a semidetailed example of the steps involved in using Etch to gather an instruction profile. A seven-line instrumentation routine instructed Etch to insert calls to the analysis routine InstReference() before every instruction in the application. InstReference() was a single-line function that incremented a counter stored in an array that was indexed by the PC. While this is a slightly simplified view of a simple tool, it illustrates the effectiveness of Etch at taming program complexity.

Chen proceeded to give three realtime demonstrations of Etch: instrumented versions of "Maze Lord" and "Monster Truck Madness" that generated a call graph and an animated cache simulation of Notepad. All three demos, while running a tad slow, performed usably well and gathered an impressive amount of data. The call graph tool is similar to the UNIX-based gprof utility and provides useful information; the animated cache simulation, although less useful, was certainly flashy and made for a great demo.

One thing that was not mentioned in the talk but is considered in the paper is using Etch as an optimization tool. In addition to instrumenting an application, Etch has the ability to reorder instructions to achieve better performance.

Chen fielded three questions about Etch. Can Etch handle self-modifying code? No, it can't. Etch cannot handle self-modifying code or runtime-generated code, and there are no plans to add that to the project. However, he does not feel this is a serious limitation.

Can Etch handle dynamically loaded DLLs? Yes, it can. Chen explained the Etch team was proud of how well they handled Windows DLLs. Etch will instrument as many DLLs as it can before the program runs based on its static knowledge. However, if the program calls LoadLibrary() to use a DLL that Etch did not know about beforehand, Etch will invoke itself on the fly and instrument the new DLL.

Can Etch monitor the kernel? No, it cannot. The Etch team is thinking about adding this capability to a future release; they would like to be able to monitor the entire system.

Spike

While Etch peripherally supports optimization, Spike focuses on optimizing AlphaNT code. Spike is a profile-driven optimizer, which is one of the things that make it interesting. Profile-driven optimization ­ attempting to make frequently executed code paths faster at the expense of less frequently executed paths ­ isn't new; it's just tough for the end-user to manage. First, the end-user must instrument the executable and all of the program's DLLs, then run the program to collect the profile information; next, generate an optimized executable and optimized DLL, then remember to run the new version of the program. Spike streamlines this process. Spike's other distinguishing feature is that it optimizes the binary images; there is no need for source code.

To make things easier for the end-user, Spike presents a simple environment for determining which applications should be optimized. Once the user decides to optimize a program ­ Microsoft Word, for instance ­ Spike adds it to its database and handles the rest. The user can continue to use Microsoft Word exactly as before.

When Spike sees that a user is attempting to execute a program in its database, it will transparently substitute the instrumented version of the program for the original and gather the profile information. Transparent Application Substitution (TAS) is an interesting bit of sleight-of-hand; because programs may be sensitive to things such as pathnames of the executable, Spike needs to go to some lengths to convince the instrumented program that it is exactly the same application that the user attempted to execute.

Once Spike gathers profile information, it can optimize the executable. This portion of Spike looks just like an optimizing compiler; the input language is Alpha machine code, and Spike builds a standard compiler representation (basic blocks, etc.) of the program. Armed with this representation and the profile information, Spike may reorder basic blocks to keep frequently executed instructions in the icache and minimize branch penalties.

Spike also attempts hot cold optimization (HCO) ­ along frequently executed (hot) paths, the optimizer may discover that many instructions are executed on behalf of the cold paths. In these cases, Spike removes the instructions from the hot path and inserts compensation code if necessary. Finally, Spike can make improved register allocation decisions based upon its profile information.

The optimized executable is usually bigger than when it started. However, it will be noticeably faster. Their performance numbers showed a 5­25% speedup for the optimized programs, and the real-world applications ­ as opposed to programs taken from a benchmark suite ­ showed the larger speedups.

Just-in-Time Code Layout

Brad Chen presented a paper on Just-in-Time Code Layout. Like Spike, it attempts to optimize code layout to improve icache performance. However, it does not require profile information. There are two interesting aspects to the paper: first, the Activation Order (AO) heuristic for code layout, second, the mechanism for dynamic code layout.

Optimizing code layout to improve icache performance is an NP-hard problem. However, compiler writers can use heuristics to improve icache performance. The Pettis and Hansen algorithm attempts to place procedures that frequently call one another close together in memory (e.g., if procedure A() calls procedure B() 10,000 times and procedure C() 100 times, the Pettis and Hansen algorithm would lay the code for A() and B() closer in memory). The Pettis and Hansen algorithm requires profile information to operate. (Spike uses this algorithm to optimize basic block layout).

Chen's paper proposes an alternate heuristic: Activation Order. As its name suggests, the algorithm arranges code in the order in which the code is called (i.e., the first function called is the first in memory, the second called is the second in memory, etc.). The advantage of this heuristic is that it does not require profile information, and their results show that it performs comparably to Pettis and Hansen. However, this algorithm requires the code layout to occur at runtime. To accomplish this, the link-time generated code segment consists of a series of thunks, one for each procedure.

When a thunk is executed, it performs the following tasks. First, it loads the corresponding code into memory if it isn't already present. Second, it modifies the call site to directly use the dynamically loaded code. Thus, the code copying occurs only once per execution, and the thunk will be activated once per call site.

In addition to improving cache locality, Just-in-Time Code Layout offers another significant advantage: it can reduce overall memory requirements for an application by up to 50%. This is because only the code that is actually executed will be loaded into memory. The Pettis and Hansen algorithm may be able to provide similar advantages, but Chen was unable to quantify this.

JITCL also avoids a problem that plagues Pettis and Hansen when used in the Win32 environment. Because Pettis and Hansen places all frequently executed code at the beginning of a module, it will actually increase cache conflicts between code in different DLLs. DLLs are an inescapable part of the Win32 environment, so the Pettis and Hansen algorithm tended to increase L1 and L2 cache miss rates for Win32-based workloads.

Someone asked if Just-in-Time Code Layout eliminates code sharing among applications. Yes, it does. If two different applications use a DLL that has been engineered to use the Activation Order heuristic, the two applications would share the thunks (which are in the code segment) and not share the dynamically loaded code.

DIGITAL FX!32

Anton Chernoff presented DIGITAL FX!32, a binary translator of x86 NT executables to Alpha/NT executables. This research was motivated by Digital's desire to get a large application base for its Alpha computers and thus had the following design goals:

  • Transparency. Users should not need to change the way they interact with their programs. In addition to allowing users to simply click on their favorite Intel binaries, DIGITAL FX!32 needs to handle all of the complexities of the Win32 environment, such as DLLs and OLE objects.
  • Performance. X86 code translated by DIGITAL FX!32 needs to perform well. Digital's goal is to get the applications to perform 70% as well as native Alpha applications.
  • Correctness. The Alpha code needs to do the same thing as the original x86 code.

The "Transparency Agent" is the component of DIGITAL FX!32 responsible for ensuring that the user never knows that DIGITAL FX!32 is there. It accomplishes this through a technique referred to as "enabling." Any process on the system must be enabled in order to start another process, and enabling consists of inserting hooks to DIGITAL FX!32 code in place of all of the load/execute Win32 APIs. Thus, whenever an enabled process creates a new process, it goes through DIGITAL FX!32 code. The child process will be enabled before it is executed, keeping this process transparent to the user.

The fact that the "enabled" state cascades from parent to child means that DIGITAL FX!32 can enable the entire system if it can enable all top-level processes. DIGITAL FX!32 accomplishes this by enabling the Service Control Manager and RPCSS at boot time.

The only problem is the shell (explorer.exe). DIGITAL FX!32 edits the registry to start fx32strt.exe as the user's shell; this program starts the Windows Explorer and enables it. All processes that the user will then run will be enabled. However, the Explorer looks at the registry to see if it was the user's shell and behaves differently if it wasn't. To get around this problem, DIGITAL FX!32 needs to temporarily change the registry to let Windows Explorer think it was the shell, then change the registry back so that the DIGITAL FX!32 start program will be run the next time the user logs in. It is, Chernoff admitted, a horrible hack, but it was the only way they could achieve the needed transparency.

The first time a user tries to run an x86 binary, DIGITAL FX!32 uses an emulator. This piece of tuned Alpha assembly gives such good performance that Anton claimed that, on multiple occasions, he would turn off DIGITAL FX!32's translation for debugging purposes, forget to turn it back on, and not notice for several days. The emulator does more than run the executable ­ it gathers profiling information that is used to drive the optimizer.

The final component of DIGITAL FX!32 is the server, which is implemented as an NT service and orchestrates the binary translation. It manages the profile information gathered by the emulator and is responsible for starting the optimizer. Like Spike, the optimizer is a binary-to-binary compiler, and it can perform the same profile-driven optimizations. The server keeps the resulting Alpha images in a database and may delete translated images to conserve disk space.

DIGITAL FX!32 has several limitations. First, it hasn't met its goal of 70% of native Alpha code, although its performance numbers are still quite good. It cannot handle 16-bit Windows applications; nor does it translate device drivers; users will still need native Alpha drivers for all peripherals attached to their systems. Finally, DIGITAL FX!32 does not support the Win32 Debug API, so a small class of programs (such as x86 development environments) cannot run under DIGITAL FX!32.

There were several questions. Someone asked if DIGITAL FX!32 can handle programs generated by Microsoft's VC++ 5.0? Yes; VC5.0 has not presented a problem.

Another person asked if the registry hack is really necessary. Yes, it is. The developers at Digital attempted many other solutions, but the registry switch is the only one that worked.

Who handles product support for applications transformed by DIGITAL FX!32? Digital does. They view any incompatibility as a bug and will work hard to fix it. However, the very first thing they do on a product support call is retry the steps that generate the bug on an Intel machine. Chernoff said that in at least 50% of the cases, the bug was reproducible in the original program. "DIGITAL FX!32 provides a feature-for-feature and a bug-for-bug translation of the x86 software," he said.

TUTORIAL SESSION

Available Tools ­ A Guided Tour of the Win32 SDK, Windows NT Resource Kit, VTune, etc.


Summary by Brian Dewey

Compared to UNIX-based operating systems, NT is an infant. One issue this might have for developers is the availability of programming tools. In this session, Intel and Microsoft teamed up to convince the audience that, yes, you can be just as productive using NT.

Intel's presentation was a lengthy sales pitch for the VTune 2.4 CD. The assortment of material does sound impressive. The backbone of VTune is a tool that passively monitors your system to find code hot spots. When you see peaks in the CPU usage, you can double-click on the chart to get a closer look at what causes the bottlenecks.

Advanced chips and operating systems will offer more opportunities for passive monitoring: VTune can monitor CPU events of the Pentium Pro (cache misses, branch prediction statistics, etc.), and NT 5.0 will notify VTune whenever a module is loaded or unloaded, which increases monitoring accuracy. In addition to passive monitoring, you can use VTune to dynamically simulate sections of code; this will provide a detailed look at processor performance.

The CD also includes "code coaches" that can statically analyze your code for optimization opportunities. This offers two advantages over relying on the compiler's optimizations. First, the coaches are guaranteed to be up-to-date with the latest chip design ­ something you can't expect from your compiler. Second, the coach can make suggestions for optimizations that the compiler, due to its conservative assumptions, could not implement.

Finally, the CD includes the Intel compilers, the Performance Library Suite, and the Intel chip documentation. Like the code coaches, the compilers will be up-to-date with the latest processor improvements. It offers profile-guided optimization, floating-point optimization, and MMX support. The Performance Library Suite is a set of libraries that are hand-optimized for various Intel chips. Intel will keep these libraries up-to-date with processor improvements.

Louis Kahn of Microsoft told the audience of a large array of tools that will ease the development process. First, he reassured people coming from the UNIX world about the availability of beloved UNIX-based software on NT. Third-party vendors have ported just about every major tool: the GNU suite, X servers and libraries, NFS, etc.

Of course, Microsoft itself makes a large array of development tools for NT. One of the most important is Visual Studio. This bundle includes a string of "Visual"s (Visual C++, Visual Basic, Visual J++, Visual FoxPro, and Visual SourceSafe) that covers most developers' coding needs. Microsoft's Win32 SDK and DDK provide essential documentation and tools. The NT resource kits contain software developed at Microsoft that, for one reason or another, was not included in the release of the operating system. (For example, many of the programs are utilities written by testers that Microsoft decided were useful enough to make available to customers.) The kits are a potpourri of useful tools that will make a developer's life easier. However, Louis stressed that the resource kits did not undergo the stress tests that NT did ­ they are "use at your own risk."

The Windows Scripting Host (WSH) will allow developers to move beyond the MS-DOS command language ­ previously the only scripting tool built in to NT. The WSH will execute JavaScript and Visual Basic scripts. Suppliers of other scripting tools, such as Perl or Python, will be able to integrate their products into this architecture. WSH will be integrated with NT 5.0 and is available for download for NT 4.0. Finally, there is the gargantuan Microsoft Developer's Network (MSDN). A universal subscription to the MSDN CDs will buy you nearly every product that rolls out of Redmond ­ compilers, operating systems, Office & BackOffice, beta software, the SDKs and DDKs, and volumes of documentation. It isn't cheap, but it can be an invaluable resource for serious Windows development. You can find more information for VTune at <https://developer.intel.com/design/perftool/vtune>,for third-party NT tools at <https://www.microsoft.com/ntserver/tools>,and for Windows scripting host at <https://www.microsoft.com/management>.

REFEREED PAPER SESSION

Driver Tricks


Summary by George Candea

This session was chaired by Carl Hauser from Xerox PARC. The audio/visual difficulties challenged the speakers considerably. [Editor's note: Some PowerPoint presentations just took off on their own; the speakers would have to stop PowerPoint and recover their place.] The session covered a number of different driver hacks that were used to extend Windows NT.

The first speaker, Bill Carpenter (VenturCom), presented "The RTX Real-Time Subsystem for Windows NT." RTX is intended for kernel-mode tasks that have hard realtime performance requirements; it attempts to solve some of the problems that real-time application developers face when using NT. RTX comes in two forms: extensions to the normal Win32 objects and the Real-Time Subsystem (RTSS).

The talk emphasized RTSS, which tries to make the development of realtime applications easier by taking advantage of NT's rich set of features as well as the availability of off-the-shelf applications. The authors wanted to make the realtime extensions conform as closely as possible to the standard NT interfaces; to this extent, RTX's API supports a subset of the Win32 API as well as additional functionality specific to realtime operations.

Programs that use RTSS are linked with the RTSS library instead of the Win32 libraries and are loaded as NT drivers. However, these processes are not allowed to call NT's driver interface, or they would wreak havoc; they have to limit themselves to using RTX's API. Modifications to the NT Hardware Abstraction Layer (HAL) were necessary in order to allow RTSS to get control of the processor in response to interrupts. As a consequence, RTSS gets to make decisions before anything else in the system. (In Carpenter's words: "RTSS can guarantee response time by stopping NT dead in its tracks.") Three types of objects can be shared between Win32 and RTSS programs: semaphores, mail slots, and shared memory objects. Communication between RTSS and NT is achieved via two unidirectional message queues.

I see some important advantages in using RTX: it offers some of the performance guarantees that realtime tasks require and RTSS maintains most of the neat NT object model, which should make programming easier (note, though, that VenturCom has not implemented resource accounting or protection). It seems that debugging can be difficult at times because RTSS-specific objects cannot be remotely accessed by processes in the other subsystems (e.g., the debugger).

The cohabitation with Win32 processes can also lead to other problems, such as priority inversion (e.g., an RTSS process could be blocked on a mutex held by a Win32 process). The use of a modified HAL as well as having all the realtime tasks running in kernel mode may not be very appealing for nondedicated systems that run other applications besides the realtime ones.

The second speaker was Jørgen Hansen from the University of Copenhagen. His talk, entitled "A Scheduling Scheme for Network Saturated NT Multiprocessors," presented a rich analysis evidencing a problem that occurs in interrupt-driven operating systems: when a device's interrupt rate becomes very high, little or no progress is achieved in the system.

The authors analyzed NT's behavior when network load becomes very high, especially on the receiving end. The network interface ends up generating a high number of interrupts, which results in the CPU spending most of its time running Interrupt Service Routines (ISRs). This leads to the ISRs literally taking over the entire system and significantly affects its performance. As a consequence, user threads are starved because ISRs have a higher absolute priority than any user code.

Furthermore, if the starved threads are the consumers of network data, network buffers can overflow and lead to packets being discarded. This leads to a vicious circle in which CPU time is used for receiving data that subsequently are thrown away; Mogul and Ramakrishnan described this a few years ago as "receive livelock." The problem can become particularly acute when the host has multiple network interfaces

Hansen also described another occurrence, which he termed "thread pinning." This happens in multiprocessors when a given processor may interrupt the network data consumer in order to process the interrupts from a network interface. As a result of the consumer thread not being preempted and the NT scheduling policy, the thread is not assigned to another processor, even though other processors may be idle. This problem can become worse if the consumer thread is part of a multithreaded application and happens to hold a lock on some shared data ­ the other threads cannot make progress either.

The solution offered by the authors was an extension of Mogul and Ramakrishnan's elegant "polling thread" idea: during high network loads, use a separate thread to poll the network instead of interrupts. They devised a two-layered scheme in which, at low network loads, interrupts are used (in order to maintain low latency), but when the network load exceeds a certain threshold, they switch to a network polling thread (to avoid thread starvation). As a method to decide when thread starvation may occur, they considered three alternatives: monitor the length of network data queues that are emptied by the user thread, monitor the interrupt rate, and measure the percentage of processor time used by the Deferred Procedure Call (DPC) of the device driver. They chose the last one.

The results showed that the two-layer system is effective in maintaining a stable throughput during high network load, and it does not incur a latency penalty during low network load. Systems using locks performed better than before, and there was no more thread pinning. However, the fixed threshold scheme does not perform well when either the network load or the amount of time applications spent per packet varies significantly. The authors also suggested that NT integrate support for both interrupt handling and polling, but that risks transforming a (still) clean system into a hodgepodge.

A member of the audience asked what the contribution of this work was, given that Mogul and Ramakrishnan had already addressed the problem. Hansen said that they had extended the idea to multiprocessor systems (and solved the thread-pinning problem) as well as gone through the experiment of modifying an already existing network driver so that it implements the two-level scheme.

The next paper, "Coordinated Thread Scheduling for Workstation Clusters Under Windows NT," was presented by Matt Buchanan (University of Illinois). He described a way to implement demand-based coscheduling for parallel jobs across the nodes of a cluster of NT machines.

Coordinated scheduling attempts to schedule threads running on different nodes in a cluster in such a way that, whenever they need to communicate with each other, they are both running. This eliminates the context switch and scheduling latencies. Given that today's high-performance networks have latencies only in the tens of microseconds, coscheduling can significantly reduce communication latency. However, coscheduling in clusters is a hard problem because it represents a compromise between the demands of the parallel tasks and the demands of the local interactive processes.

The authors used P.G. Sobalvarro's demand-based coscheduling (DCS) algorithms. The natural way to implement DCS would be to modify the operating system's scheduler, but given the goal of using DCS in clusters running off-the-shelf operating systems (such as NT), this was not feasible. Therefore, they combined a user-level messaging library with a kernel-mode device driver and a hardware-level control program running on the network card. When a message arrives, the control program decides whether to try to preempt the currently running thread and, if yes, issues an interrupt that invokes the device driver. The driver then boosts the priority of the thread that it wants to schedule, hoping that it will get scheduled. Fairness of the CPU scheduling is critical, and it is ensured by a user-level fairness monitor.

One of the lessons learned was that it is very difficult to dictate scheduling policy to the NT scheduler mainly because, in Buchanan's words, "NT scheduling is very well sealed inside the kernel." The lack of appropriate interfaces forced them to use a hack instead of a clean implementation. Unfortunately, the performance results shown were not very convincing, and Buchanan acknowledged that they need to conduct a broader array of experiments.

One of the questions from the audience addressed the problem of scalability, and Buchanan gave a brief "future work" answer. Another participant asked what happens in the case of threads that do not get scheduled as expected. The answer was that this may generate some overhead, but the threads would synchronize later and achieve coscheduling.

The last speaker in the session was Galen Hunt, currently with Microsoft Research. The work he presented, "Creating User-Mode Device Drivers with a Proxy," described a mechanism that allows user-mode drivers to act as kernel-mode drivers under certain conditions.

The architecture of his system consists of a kernel-mode proxy driver and a user-level proxy system. For each user-mode driver, the proxy driver sets up a stub entry and a host entry. The path of an I/O Request Packet (IRP) is from the application through the NT executive, to the stub entry, then to the proxy driver, to the host entry, into the NT executive again, to the proxy service and then to the user-mode driver.

Hunt presented a number of sample drivers: a virtual memory disk (similar to a RAM disk), an Echo filesystem (not related to DEC SRC's Echo) that "mirrors" another filesystem through the driver, and an FTP filesystem that allows mounting of remote FTP servers as local filesystems. The measured performance was worse than for kernel-mode drivers, but that was no surprise, because the user-mode drivers require twice the number of user/kernel boundary crossings that kernel-mode drivers require. Hunt is currently involved in developing a toolkit that would allow the creation of user-mode filesystem drivers. He cautioned though that it would be crazy to replace NTFS with a user-space filesystem, due to performance considerations.

User-mode drivers definitely have some great advantages: they can use all the Win32 libraries, they can be developed and debugged using standard development tools (they can even be written in Visual Basic or Java). Additionally, user-mode drivers can block, be single-threaded, not be reentrant, and not worry about cache or memory management, unlike kernel-mode drivers. These drivers are great for experimenting, and they can be used to emulate nonexistent devices. In addition to that, Hunt expressed the good idea of designing the drivers as Compo-nent Object Model (COM) objects that export COM interfaces for the I/O they support. This gives uniformity, and one can even aggregate these components and have drivers inherit functionality from other drivers.

Another advantage is that there were no modifications made to the kernel, which makes it easy to distribute. However, this technique is limited to drivers that do not need kernel-mode access to hardware; the performance of user-mode kernels is comparable to kernel-mode drivers only when physical I/O latency dominates computation and exceeds kernel crossing time.

PANEL SESSION

Do You Need Source ?


Summary by George Candea

Discussing the need for access to the NT source code was definitely one of the hottest sessions of the conference. The panel was organized by Thorsten von Eicken (Cornell) and consisted of Brian Bershad (Washington), Geoff Lowney (DEC), Todd Needham (Microsoft's evangelist), Margo Seltzer (Harvard), Nick Vasilatos (VenturCom), and Werner Vogels (Cornell). It was definitely useful and entertaining. It ended just before Bershad and Needham would have started to throw punches at each other. The motto of the panel seems to have been "If you don't have source, how can you do any work? If you have source, how do you figure your way around it?"

Todd Needham started the discussion by pointing out that Microsoft is interested in providing good source licensing in order to promote research, but it also wants to keep its competitive advantage and continue making money. Currently, Microsoft has 37 research-related source licensees (universities and national labs), of which half a dozen use the source code simply for reference. There is also a commercial license available.

It turns out that if one develops something while holding a research license and then wants to include it in a product, one needs a commercial license. Currently, there is no need for students to sign a non-disclosure agreement (NDA) as long as the university has an agreement in place with the students regarding intellectual property rights. The licensee retains all intellectual rights, but Microsoft gets the license for any software that uses the source code. The license is really meant only for research. One can exchange source code exclusively with institutions that hold a source code license.

Needham mentioned that there are really two ways to get source code: either by having the full source or by using the Software/Driver Development Kits (SDK/DDK) combination. It seems that one other high-demand issue is how to write and install a filesystem in NT. Needham also said that the documentation for NT source is the source itself. (Having hacked for a summer on the NT kernel, I can attest to the fact that the NT kernel and executive code are pretty well documented.)

Seltzer was the next speaker, and she set out to answer the following three questions:

  • do we need source?
  • Microsoft should want us to have source?
  • are the obstacles?

In answering the first question, she pointed out that, although NT makes it easy to collect a lot of data, it is sometimes difficult to correlate results collected from different sources. In addition, one major issue in operating system research consists of making hypotheses based on measurements and then verifying them based on the source (typical UNIX approach). So, because of the lack of NT source code, we see that most of the papers reporting research done under NT tend to be wishy-washy when it comes to proving their theories.

The second question's answer is simple: universities are a source of great ideas, interesting code, and bug fixes. Also, students using NT source code will be well versed once they graduate and can make tremendous contributions to the vendors. Besides, research results are directly relevant to vendors.

Finally, a big problem is that there is a large bureaucracy both in universities and at Microsoft. Microsoft requires the university not the principal investigator, to sign the license (as with other source code licenses), and that is hard to do. The university should not be liable for students stealing source code. Also, there are Export Act constraints (e.g., foreign students working on projects), which is why, for instance, Harvard University does not do any top-secret work for the government. One question that clearly arises is whether an approval from Microsoft will be needed before submitting a paper describing work done under NT. That would be very bad.

Lowney was next. Because of some miscommunication, he did not know exactly what the panel was supposed to be about, so he talked about how to do work without having source code. He spoke about Spike, PatchWrx, DCPI, and Atom. At the end, he mentioned that most researchers make modifications that are small and in familiar places, so one possibility would be for Microsoft to provide a "research build" of NT. In addition to that, making executable instrumentation very simple and defining interfaces to interesting modules would be salutary.

Vogels brought a healthy dose of humor to the panel. He used handwritten slides; and, after describing the projects he worked on, he concluded that he would not need source code for any out of his nine projects. He pointed out that the DDK is really for hardware vendors, and the examples are useless for researchers.

Vogels cautioned that NT consists of millions of lines of code and only six pages of documentation, which barely tells you how to get it off the CDs and how to build a kernel from scratch at the top level. You need a 5GB disk (I managed easily to fit it in less than 3GB along with the entire build environment). This amount of code makes it difficult to search for strings; {findstr /s /I} doesn't really cut it when you're talking millions of lines of code. The MS Index server, although excellent for Web pages, crashes on source code; and when it works, it comes up with 1,000 irrelevant matches.

The conclusion was that Vogels's group used source code only as documentation (there is no other documentation for NT), examples, and to understand the behavior of NT. It turned out to be useful for debugging, and it led to the discovery of interesting APIs that are not documented or available in Win32.

Vasilatos wanted to convince the audience that VenturCom is not a source code licensee that creates unsuccessful products. [Editor's note: Not all of VenturCom's products have been successful, apparently.] It seems they have a license only for the HAL, and they focus on doing embedded systems installations.

Bershad's talk focused on the statement that you don't necessarily need code in order to do cool research (such as Etch and Spin x86). He then addressed the issue of why you might want to have source code: primarily to look at things (documentation, samples, measurements, analysis) and to build stuff (do derivative work that is better, cheaper, and faster as well as new stuff). However, there are reasons why you might not want source: it stifles your creativity and places golden handcuffs around your wrists (Microsoft owns everything). The question is not really whether you need source or not. What you really care about is "do you need information" about the system you're doing research on. And the answer is categorically yes ­ you need a good debugger, documentation, etc. Simply shipping the source code won't help.

At this point the panel started accepting questions from the audience and a multiperson discussion ensued. Thorsten added that the Microsoft Developer's Network (MSDN) actually has a lot of documentation on the kernel APIs. Someone suggested providing the source code on microfiche, the way VMS did. However, searching it would clearly be a problem. Another member of the audience pointed out that opening the source code would increase people's trust in NT's security. Needham's view on this was that, even though Microsoft doesn't want to achieve security by secrecy, it needs to fix some bugs before releasing the code to universities.

Another important point was that installing NT in a university could be the first large-scale system for which they don't have source code. Having the source proved very useful in the past, and not having it could be a barrier to adoption in organizations. A system administrator said that, when he recommends NT to his customers, he tells them he cannot guarantee that he will solve all their problems. With UNIX, though, it's different: he knows all the internals by heart and can quickly fix the problems that arise.

It turned out to be hard to say whether having source is better than not having it. Historically, we've seen only systems with source. Also, most people who got source licenses do not use source, because the SDK and DDK contain lots of unencumbered source code.

Needham mentioned that Custer's Inside NT book is being updated by an outside writer who has access to the source code, and the goal is to make it a Magic Garden Explained type of book for NT.

Someone mentioned that there isn't a lot of educational value to having source: having students look at NT in order to learn about operating systems is like having them look at C++ to learn about programming languages. Seltzer responded that NT wouldn't be used in an undergraduate-level OS course, but further on (e.g., graduate level) students gain lots of depth when they look at a real, live OS. Vogels followed up, pointing out that Linux is very successful among students because they know they have full control over what is on their desktop and that encourages them to experiment with new things (as opposed to using a simulator). Having seen real systems lets students do incremental work without redoing everything.

Seltzer then ran a straw poll in which she presented the scenario of two students, A and B. A had worked only with toy operating systems, whereas B had worked with a real OS. The question was which one would the audience hire. The unanimous answer was B.

At this point, the conversation started heating up. Needham mentioned that in the commercial market, Microsoft seeks a 100% share; but in research, they're not after that.

Bershad said that, in order to have interest in licensing source code, there must be a stake in it. Needham replied that NT still needs to compete. So Bershad asked who does NT really compete with and what happened if Microsoft released the source code. There was no clear answer. Bershad also said that what made Microsoft successful is not the technology as much as its marketing and vision. So what do they lose if they release source? Needham said that Microsoft is new to the university licensing business and the first time they released code was two years ago, so they are still learning and are cautious.

Bershad said the big problem that makes the source license untractable is that, if you do something with NT source, you can't distribute your work. So a suggestion was to model the NT source license after the UNIX licenses. Needham said that the Solaris license doesn't give the freedom Bershad was talking about and that the GNU public license is not something that Microsoft can do. Bershad then said that what we really need is information about the internals.

Seltzer gave an example: if she comes up with a cool filesystem under NT, it becomes Microsoft's property, and it wouldn't let her put it in Solaris. Researchers don't like that. Needham said that was indeed derivative work, and the license would not allow it to be used in commercial products unless a commercial license were purchased. He added that derivative work is what one does as a student (when looking at the NT source) but not what one does based on what one has learned.

KEYNOTE ADDRESS

What a Tangled Mess! Untangling User-Visible Complexity in Window Systems


Summary by Brian Dewey

Rob Short delivered the second keynote address of the workshop. His talk previewed the ways NT 5.0 will decrease the complexity of system management; he told us of the new version's goals and both the obstacles and technologies related to those goals.

One of the most touted features of NT 5.0 is Plug-and-Play ­ the ability to attach or remove a hardware device to or from your computer and have it work without manual configuration. Short, in this segment of his presentation, merely tried to convince the audience that this was a tough problem. On the surface, this might not seem so bad ­ after all, a typical computer consists of perhaps two dozen pieces of hardware, and resource allocation is a well-defined and understood problem. In spite of this, 10% of hardware changes will break a user's system.

Two main problems harass Microsoft's Plug-and-Play developers. First is the sheer number of devices. Short told us that NT 5.0 supports nearly 5,000 base system designs, over 4,000 add-in cards, and around 1,200 printers. And these are modest numbers. Short hinted that Windows 95 supports roughly three times as many hardware devices. The immense number of combinations is an inherent obstacle to Plug-and-Play.

Compounding this is the second problem: the lack of hardware standards. When standards do exist, few vendors implement them completely. Short explained this is a by-product of the economics of hardware production ­ a vendor, faced with the choice of shipping a device that partially implements a spec and taking extra time to redesign the device is almost always better off shipping. Consequently, the burden of making the devices work falls on the software developers.

It's quite a burden; it impairs the ability to implement the most fundamental resource allocation algorithms. For example, when the Plug-and-Play architecture parcels out address space to devices, it needs to deal with cards that have only a limited number of options and others that will alias different memory addresses. Although Plug-and-Play is conceptually straightforward, it seems to be one of the largest obstacles to the completion of NT 5.0.

Even if Plug-and-Play removes the hassles of hardware configuration, users would still be faced with a Byzantine application installation process. In a way, this is the other edge of NT's legacy; you can trace this problem back to the goals of DOS and early versions of Windows. These systems ­ designed for a single-user, single-computer environment ­ had an extremely poor separation between user, application, and system resources. NT, while poised to inherit the business of these systems, also inherited the lack of structure. Thus, it seems that every application installs files into the Windows directory, and many of them will have DLLs with the same name. When one version overwrites another, an application that once worked may now be broken.

The solution to this problem is to impose structure on applications. The eventual goal is to have better-behaved applications, and Microsoft has published new application guidelines to help developers. In the meantime, NT will need to force older applications to obey the new rules. To accomplish this, NT 5.0 will lock down the system directory; NT will perform a little directory slight-of-hand to fool applications that insist on writing to system areas. Applications will be distributed in self-contained packages that will facilitate installation and removal, and Microsoft will develop a toolkit that can transform existing, structureless applications into packages.

From a technology standpoint, the most interesting aspect of NT 5.0 is the support for large installations: corporations with thousands of computers on desktops (and increasingly, on employees' laps). Microsoft is attempting to solve two large problems with its large installation technologies. First, existing Windows systems require the end-users (i.e., the thousands of corporation employees) to also be system administrators. Second, existing Windows systems don't provide the tools that the designated administrators need to watch over the thousands of users.

To address these issues, NT 5.0 will provide facilities for automatic installation and updates of the operating system and applications; additionally, NT 5.0 will make it easier to keep the software consistent across the corporation. Admin-istrators can assign policies and applications to groups ­ this is an essential feature when managing thousands of users. NT 5.0 will also support roaming profiles (users may log on to any machine and get their customized settings) and system replacement (when a computer goes down because of hardware failure, a spare can take its place and get the proper settings automatically).

NT will rely on several new pieces of technology to meet these goals. The first is a Coda-like filesystem. Under this scheme, the master copy of every file will reside on a server; to increase the efficiency for both the client and the server, the client keeps files cached on the local hard disk. The locally cached files also allow users to continue working when their computers are disconnected from the network ­ say, in case of a network failure or when roaming with a laptop. Upon reconnection to the server, the filesystem will reconcile changes made to local files with the master copies stored on the server.

To increase the practicality of this solution, NT will also include a single-instance store in the server filesystem. If the server detects that two files are identical, it stores only one copy on the server. This will be a "copy-on-write" file ­ a user who attempts to change a unified file, will get a private copy to modify. The combination of the Coda-like filesystem and the single-instance store will allow nearly the entire C:\ drive to be a network cache; because the application files will be shared by most users, the single-instance store will minimize the space impact on the server. All applications will be centralized on the server, so administrators will have a much easier job updating applications and ensuring the consistency of applications across the corporation.

The caching and single-instance store also make remote boot an attractive corporate option. By keeping all applications and data stored on a centralized server, a brand-new NT machine can be connected to the network, turned on, and get all of its data automatically. Because the data will be kept on the local disk, this new NT machine will be operable even in the case of network failure, and the common case (i.e., not a brand-new computer) will not require extensive data transfers from the remote boot server. Ideally, this technology will allow administrators to plug a machine in, turn it on, and have it work ­ true zero administration.

Microsoft is doing a lot to make NT systems easier to administer. Some in the audience thought it was odd that Microsoft was adding features to NT in an attempt to combat system complexity. However, the features being added to 5.0 are not merely a marketing wish list; they provide critical capabilities to an operating system that's outgrown the single-user/single-computer environment. The market will judge if this effort works. And given Microsoft's success in the market, I suspect NT 5.0 will be well received.

REFEREED PAPER SESSION

Performance


Summary by Brian Dewey

The papers in this session all dealt with some aspect of performance under NT. Two talked about measuring performance, one about high-performance uncompressed video, and one about adding real-time performance to Windows NT.

Measuring Windows NT ­ Possibilities and Limitations

Yasuhiro Endo of Harvard University presented this paper; it argued for the development of a new methodology to measure the performance of NT systems. Most benchmarks measure throughput; however, with NT, as with any graphical environment, what users want is quick interactive feedback instead of impressive throughput numbers.

On top of that, the things that tend to infuriate users are the tasks that they expect to be quick but for some reason take a long time. Benchmarks are especially ill suited to diagnose any system with this behavior, because they typically use statistical methods to smooth over any anomalies. Further, the conditions under which today's benchmarks run exhibit little resemblance to the environment in which the computers are commonly used; the test machines are usually disconnected from the network and rebooted before each test to eliminate extraneous factors in the benchmark numbers.

What NT systems need to monitor, Endo argued, is the number of times the system exhibits anomalous behavior that aggravates the user ­ the instances when the user expects a quick response from the computer yet stares at the hourglass cursor for several seconds or minutes. In addition to merely noting the number of times these situations arise, Endo would like to discover the reason for the slow response time.

To accomplish this, he proposes passively monitoring the entire system. When users experience anomalous behavior, they notify the monitoring tool either through a mouse click on a special icon or ­ this was my favorite part of the talk ­ a pressure-sensitive pad that they can punch. Upon receiving this notification from the user, Endo's tool would dump a detailed log of the system's state over the past several seconds, including:

  • Per-event latencies. The system will keep a log of how long it takes for every event (mouse click, network packet arriving, etc.) to be processed by the computer.
  • Thread status. The system will log which threads are running and which threads are blocked.
  • Kernel profile. The tool will closely watch the OS kernel for the duration of this intense monitoring period and write profile information to the performance log.

Using this information, Endo would then attempt to discover the precise cause for the slow response time. However, Endo has not implemented this ambitious plan; the obstacle is the lack of NT source code. Although not necessary for most of the data gathering, Endo argues that the source code is indispensable when attempting to analyze the resulting data. Without source code, he claimed he would be forced to be a "natural scientist": presented with anomalous behavior, he could make a hypothesis as to its cause and devise more tests to confirm that hypothesis. That is the standard operating procedure for a natural scientist; a computer scientist, he believes, would just check the source code for confirmation of a hypothesis. Further, without source, he would never be able to validate his hypotheses, no matter how much he tested them.

Microsoft will license its source code; however, the licensing agreement is unacceptable to Endo and the Harvard lawyers. The primary sticking point is the confidentiality agreement; Endo has understandable reservations about signing something that may prevent him from publishing his results in the future. The lawyers have quibbled over the signature authority; Microsoft wants the university to sign the agreement, but the university believes it is the job of the principal investigator.

I was disappointed by this presentation, but I liked the paper. I thought the design of a new testing methodology was both interesting and useful, and I wish Endo had been able to implement his design and provide us with results. However, the balance of the presentation was very different from the balance of the material in the paper.

In the paper, Endo and Margo Seltzer go in depth into the proposed design and relegate the complaints about the source licensing agreement to a single, short paragraph. However, in the presentation, there was less detail about the methodology and proportionately more time spent discussing the flaws in the way Microsoft does business with the research community. Thus, I was left with the impression that Endo was making more of a political statement than a contribution to the research community. I'm glad to say the paper proved me wrong.

Measuring CIFS Response Time

A second paper picked up on the response time theme: "Adding Response Time Measurement of CIFS File Server Performance to NetBench," presented by Karl Swartz of Network Appliance. The impetus for the paper was the fact that NetBench, the most widely used PC-oriented benchmark, measures only fileserver throughput. (This stands in contrast to SPEC SFS, used to analyze the performance of NFS fileservers, which measures both throughput and response time). The paper described an addition to the NetBench measurements that accounted for response time.

Swartz did not have access to NetBench source code, so it was impossible to modify the benchmark itself. He overcame this problem by putting a packet sniffer on the network. The packet trace was then analyzed offline. By matching client requests with server responses, he was able to compute response time.

Armed with the ability to measure both response time and throughput, Swartz gave some preliminary numbers comparing Network Appliance's F630 fileserver with a Compaq ProLiant 5000 running NT 4.0. (Interestingly, this is the only place that NT directly enters this paper; the NetApp F630 runs proprietary software and doesn't handle the NT extensions to the CIFS protocol.) Not surprisingly, the F630 performed well compared to the Compaq on both the throughput and the response time metrics.

What was surprising was that the Compaq's throughput, after dropping off once the benchmark exceeded 20 clients, started increasing once the working set exceeded the amount of memory on the server. This seeming paradox was explained by the newly obtained response time numbers, which showed a dramatic increase in response time over the same period. Swartz hypothesized that NT switched algorithms and sacrificed response time to improve throughput when the server was under a heavy load.

High-Quality Uncompressed Video over ATM

This was a difficult presentation to follow, in part because the sound system wasn't working well and in part because Sherali Zeadally, the presenter, spoke at an amazingly rapid clip. Luckily, I had the paper to refer to when transcribing my sketchy notes!

This research addresses the issue of sending uncompressed video over a network. Although the paper and the talk briefly touched on all of the major components of a viable system that uses uncompressed video, such as the large amount of disk storage needed (1GB for 45 seconds!) and tools for multimedia editing, the focus was on the network bandwidth requirements.

Zeadally spent a large amount of time in his presentation justifying the need for uncompressed video. His argument rests on two points. First, the delivery of uncompressed video reduces computational overhead. Second, and more important, uncompressed video is artifact-free; this is a crucial benefit for applications such as medical imaging. (Lossless image compression, which is also artifact-free, doesn't gain enough compression to be worth the computational overhead, Zeadally argued.)

The major obstacle to a high-bandwidth application such as uncompressed video has been slow networks and slow workstation busses. However, these obstacles are eroding. Currently, however, a designer of a high-bandwidth application needs to rely on custom or proprietary hardware to deliver the required performance. Zeadally's research attempts to deliver the uncompressed video using an off-the-shelf, open architecture. His prototype system uses DEC alpha workstations running Windows NT 4.0 with a PCI bus and an OC-3 ATM network.

To test the system, Zeadally used video captured at 15 frames/sec; sending this over the network required throughput of 110 Mbits/sec (the ATM network was capable of 155.52 Mbits/sec). He measured the application-application throughput for both TCP/IP and UDP/IP. Both protocols were able to deliver the required throughput. The TCP/IP test kept the CPU utilization at around 55­60%; for UDP/IP, the utilization was around 50%. Zeadally explicitly notes in the paper that these results exonerate Windows NT from the charge that it cannot deliver high performance over an ATM network; he cited other bottlenecks in previous researchers' results that led to their low throughput numbers.

As with Endo's talk, I was much more satisfied with Zeadally's paper than with his presentation. During the talk, I was too busy keeping up with the breakneck speed of delivery to process the contributions that the paper made. This research is an interesting proof of concept for the viability of network applications based on uncompressed video, and it provides an amazing example of what's possible with commodity computer technology. Unfortunately, my ears proved to lack the bandwidth required to process this talk and reach this conclusion in realtime!

Dreams in a Nutshell

Steven Sommer presented Dreams, a set of realtime extensions to NT 3.51. Traditional realtime systems and conventional operating systems work in very different environments. One of the largest jobs of a conventional operating system is to protect processes from one another. A realtime system can assume that all of the processes in the system are cooperating toward a common goal and therefore do not need the types of protection that a conventional OS provides. However, a realtime system needs to ensure that the different processes on the system are able to meet their realtime deadlines. The goal of Dreams is to combine both worlds by adding the capability for "temporal protection" to NT: processes can now specify realtime deadlines, and NT will protect those deadlines from other processes, just as it protects the address space and other resources of processes.

The building block for temporal protection is the "Transient Periodic Process." A TPP has a period, which specifies how frequently the process needs to run, a deadline, which specifies how long the process has after the start of the period to get its job done, and an expected execution time, which is how long it thinks it needs each period to accomplish its task. When an application wishes to create a TPP, it sends a request to the process manager, which in turn talks to the reservation manager.

The reservation manager performs a schedulability test for the TPP. If it passes this test, then the process is accepted. At this point, the operating system guarantees that the process will get its reserved time and that it will miss its deadline only if it uses all of its expected execution time. A TPP is said to "overrun" if it uses all of its expected execution time without completing. The Dreams scheduler may allow an overrun process to continue executing, but only if there is no other nonoverrun TPP that is ready to execute. The Dreams system has a schedule enforcer that will preempt overrun processes if necessary.

Although the realtime extensions are interesting in their own right, the project makes contributions through its implementation that are relevant outside of the realtime community. First, Sommer's team implemented most of Dreams as an NT subsystem; therefore, they needed to make only minor modifications to the NT Executive and the LPC mechanism. Even the Dreams scheduler (which selected the realtime thread to run next) lived in the subsystem. This led to both a clean design and to code that was easier to test and modify than if it had been placed in the kernel.

In this respect, Dreams is a persuasive case study for researchers who want to extend the capabilities of NT in similar ways. The two-tiered scheduling design that resulted from the Dreams scheduler living in the subsystem hid much of the scheduling complexity and made for a simpler model. Further, Dreams needed a system of priority inheritance to ensure that if a realtime thread was waiting for a regular system thread to complete, the system thread would inherit the realtime priority. The priority inheritance improved the performance of the system as a whole, and Sommer argued that it would make a valuable addition to NT even in the absence of realtime needs.

INVITED TALKS AND PANEL

Building Distributed Applications ­CORBA and DCOM


Summary by George Candea

This session was moderated by Carl Hewitt (MIT) and consisted of Peter de Jong (HP) and Nat Brown (Microsoft).

Hewitt started by giving an overview of CORBA and COM. He said COM was what Microsoft used and CORBA was what everyone else did. He then pointed out that, due to market pressure, CORBA needs to bridge to COM and vice versa. Thus, CORBA evolves and includes features that COM has and CORBA doesn't (such as unique identifiers for interfaces and unique naming of object factories). The reverse is true as well: there are missing capabilities in COM that exist in CORBA (e.g., unique identifiers for objects, well-systematized runtime information from repositories, and class hierarchies).

Another problem is that certain features (e.g., persistence) are different in CORBA and COM. Hewitt mentioned that both CORBA and COM lack transparency and simplicity. Not only is "transparency heaven in this business," but both are highly complex and used mostly by wizards. This leads to numerous opportunities for errors, exceptions, misunderstandings, and subtleties. Cross-platform development is very challenging.

An interesting point that Hewitt made was that there is a new kid on the block that causes both CORBA and COM to evolve even faster: Java. It has nice things such as garbage collection (within a few years they will even have distributed garbage collection). Java has a lot of metainformation at runtime, which COM doesn't have (cannot inquire what the methods are and what the objects are). Components are also alive in Java before you actually run your compiler on your code. So COM and CORBA have to do Java. No more Interface Definition Languages (IDL)! Everybody hates IDLs, and Java lets you avoid them.

de Jong essentially discussed a "top ten list" of CORBA's advantages: language heterogeneity, components, transports, application coordination, reuse of services, interoperability, interworking, computation tracking, resource tracking, and scalability. He said the heterogeneity of languages on CORBA was excellent, because it supports C, C++, Java, Ada, Smalltalk, and Cobol. (PARC's ILU also supports Common LISP and Python.)

Brown then talked about COM, emphasizing its scalable programming model: in the same process you use fast, direct function calls; on the same machine you use fast, secure IPC; across machines you use the secure, reliable, and flexible DCE-RPC-based DCOM protocol. He then introduced DCOM as COM++. DCOM adds pluggable transports (TCP, UDP, IPX, SPX, HTTP, Msg-O) between client machines and servers and pluggable network-level authentication and encryption (NT4 security, SSL/certificates, NT Kerberos, DCE security), does connection multiplexing (single port per protocol, per server process, regardless of number of objects), is scalable, and uses little bandwidth (header is 28 bytes over DCE-RPC; keep-alive messages are bundled for all connections between machines).

Further areas of research for COM seem to include high-level language integration, application management ease of use, and deep/robust extensibility. Brown chose to mention what he thinks is wrong with CORBA: focus is on cross-node or networks reuse/integration, which is not practical for horizontal reuse/integration; incomplete specification (e.g., marshalling format of certain types of data structures or implications of lack of services such as naming, events, lifetime management); and it is not architected for extensibility.

A number of questions followed. One of the first ones asked was what the real long-term solution to authentication, especially in the context of interoperability. The answer was that this is a hard problem due to object delegation, which poses the question "who is the object talking for?" Brown said that DCOM is just taking a stab at the security and authentication problem using role-based security. Hewitt added that the systems need to include auditing because they are extremely complex.

Someone mentioned that last October Microsoft had said it would give ActiveX technology and specifications to the Open Group, for integration. Brown said the Open Group indeed has all the source code and specs, but it is moving slowly. The person asking the question said The Open Group is claiming that Microsoft is the reason for the delays. The question remained up in the air.

When someone asked for a comparison of the scalability of the two models, neither de Jong nor Brown could make a convincing argument that they scaled well.

Another question asked how much traffic the keep-alive messages generated in DCOM. Brown said that every two minutes there is a 40-byte UDP packet (with security disabled) going between every pair of machines (whenever no logical connections exist). Over TCP the keep-alive traffic would be even higher (they need to fix this).

REFEREED PAPER SESSION

Distributed Systems


Summary by George Candea

The first paper in this session was "Brazos: A Third-Generation DSM System." The motivation for the work presented in this paper was based on the observation that technological advances in networks and CPUs have made networks of workstations a viable replacement for bus-based distributed multiprocessors. A "third-generation" DSM system, which uses relaxed consistency models and multithreading, seems particularly appropriate for networks of multiprocessor PCs.

One of Brazos's core components is an NT service that must be installed on all the Brazos machines. This service is responsible for receiving and authenticating incoming DSM requests. Also, in order to provide a way for system threads to update memory pages without changing the page's protection, Brazos makes use of an altered mmap() device driver to mimic the UNIX mmap() call.

There are a number of differences between UNIX and NT, and some of the ones relevant to the implementation of a DSM system are NT has native multithreading support but doesn't have signals, NT's use of structured exception handling, and the TCP/IP stack that is implemented through a user-level library (WinSock).

Brazos attempts to take full advantage of NT features, such as using multithreading to allow computation and communication to occur in parallel, using multicast as a means for reducing the amount of data sent over the network, and using scope consistency to alleviate false sharing of pages when threads are actually updating separate data elements. Multicast turns out to work especially well in time-multiplexed networks, such as Ethernet, because the cost of sending a multicast packet is the same as sending a regular packet. Brazos also tailors data management at runtime, as a function of the observed behavior.

One significant advantage of Brazos, and good DSM systems in general, is that the programmer can easily write programs that access memory without regard to its location. However, the programs must be linked with a static Brazos library. Brazos does increase performance, but not in a very dramatic way. The performance slide did not have the axes labelled, but if I interpreted it correctly, the biggest speedup (of 1.64) was obtained on an application that takes advantage of the scope consistency model.

The second paper was "Moving the Ensemble Communication System to NT and Wolfpack," presented by a humorous Werner Vogels (Cornell). Werner debuted his talk asking, "If you were in the emergency room, would you trust NT to run the systems that take care of you? What about if you were on a plane? What about trusting it to drive your NYSE network? Our goal is to make you feel comfortable with these situations." Ensemble attempts to add reliability, fault tolerance, high availability, security, and realtime response to clusters of computers.

The emphasis of the talk was on the issues that came up during the migration of Ensemble from UNIX to NT platforms. For example, the new Ensemble is coded in OCaml (dialect of ML), which allows the authors to use a theorem prover to verify correctness. The OCaml runtime under Win32 had to be extended in order to allow for the different interface semantics of Win32 (e.g., files and sockets are different types of objects under Win32, whereas they belong to the same type in UNIX). But most of the porting time went into developing a common build environment for UNIX and NT. Ensemble source is currently maintained under UNIX, and the necessary Win32 make and dependency files are generated at checkout time.

The NT version of Ensemble has COM interfaces, which allow for increased flexibility. Applications using a standard DCOM interface can get transparent replication from Ensemble. The authors also used Ensemble to strengthen Wolfpack by adding higher availability and scalability (through software management of the quorum resource), support for hot standby (by using state-machine replication techniques), and programming support for cluster-aware applications.

The performance slide, which showed encouraging results for Ensemble, assumed that all the Ensemble servers were local. Werner acknowledged that if the servers were distributed over a WAN, the performance results would be completely different. One of the questions asked what mechanisms made Ensemble cheaper than DCOM. Werner said that the marshalling in DCE-RPC is very generic and expensive, but Ensemble's protocol itself (as well as the marshalling) is much cheaper than DCE-RPC. Also, the way the object resolver on the server side works is less complex than the one distributed by Microsoft.

REFEREED PAPER SESSION

We're Not in Kansas Anymore


Summary by George Candea

Partha Dasgupta from Arizona State University presented the paper entitled "Parallel Processing with Windows NT Networks." He was one of the speakers who believes NT has many features that make it better than UNIX. He described (with fancy, colorful slides and animations) the techniques they used and what they faced when moving their parallel processing systems to Windows NT.

When porting Calypso, they came across some important differences between UNIX and NT, most notably the fact that NT does not support signals; it uses structured exception handling, has native thread support, does not have a remote shell facility, and applications are expected to be integrated with the windowing system. Also, the NT learning curve is very steep for UNIX hackers.

Chime is a shared memory system for Compositional C++ for a network of NT workstations. Chime tries to achieve such goals as structured memory sharing and nested parallelism. Built from the very beginning on NT, Chime took advantage of NT features that, according to Partha, are more elegant than under UNIX: user-level demand paging, support for handling thread contexts, and asynchronous notification. Also, in the context of the other projects, threads turned out to be an elegant solution for process migration, distributed stacks, and in segregating functionality. Some other advantages of NT are the good program development environment and tons of online documentation. The end results obtained with Calypso and Chime under NT were comparable to the results obtained on UNIX.

One of the lessons that Partha suggested to take home was not just to modify applications for NT and recompile; rather, redesign them in an "NT-centric" way, so they can take advantage of the operating system's features. He also cautioned that Microsoft's terminology can be confusing for UNIX people: the "Developer's Network" (MSDN) is not a network, the "Developer's Library" is not a library, the "Resource Kit" contains nothing about resources, and "Remote Access" does not let you execute anything remotely.

The following two papers generated a lot of discussion and were definitely followed with great interest by the audience. The first one, "OPENNT: UNIX Application Portability to Windows NT via an Alternative Environment Subsystem" was presented by Stephe Walli from Softway Systems. He started his talk with Walli's First Law of Application Portability: "every useful application outlives the platform on which it was developed and deployed." Walli emphasized the need for writing applications to a particular model of portability (such as POSIX); porting to new platforms that support that standard would be much easier.

There are a number of ways in which UNIX applications can be ported to NT: complete rewriting, linking with a UNIX emulation library, fiddling with the NT POSIX subsystem, or using the OpenNT subsystem. A number of elementary programs can simply be recompiled under NT, and everything will work fine; but most of them use operating system resources, which make the port much more difficult. Using the POSIX subsystem can sometimes be a source of surprise: major aspects work as expected (e.g., signals), but there are many details that are different from the "UNIX-style" POSIX.

The goals of OpenNT were, among others, to provide a porting and runtime environment for migrating source code from UNIX to NT platforms, to ensure that any changes brought in to the source code will not make it less portable (by being NT specific), and to ensure that NT's security is not compromised. OpenNT is an NT subsystem consisting of the subsystem executable, a terminal session manager, and a dynamic link library. OpenNT currently supports POSIX 1003.1 (including terminals), the ISO C standard library, mmap, System V IPC, cron, curses, Berkeley sockets, System V and Berkeley signals, etc.

Win32 and OpenNT share a common view of the same underlying NT File System (NTFS) (OpenNT does not add any UNIX-ish filesystems), and OpenNT adds some functionality above that. POSIX permissions are mapped from the file's ACLs. There is no /etc/passwd or /etc/group, and the standard /usr and /bin do not necessarily exist, which means that applications would have to be redesigned if they assumed their existence. Security and auditing features available in NT are available to OpenNT applications as well. An OpenNT terminal is a Win32 console. Cut-and-paste works flawlessly between Win32 and OpenNT applications. The OpenNT X11R6 server is a Win32 application. There is a telnet daemon that ships with OpenNT and is a direct port of the real telnetd.

Some of the performance results are interesting: CPU-bound applications exhibit the same performance under Win32 and OpenNT, and they run faster under OpenNT than under a traditional UNIX. Disk performance under OpenNT is close to that under UNIX. OpenNT is outperformed by Win32 on small block I/O but does better than Win32 on large blocks. Also, socket throughput seems to be the same, independent of which system is being used.

The next paper, "U/WIN ­ UNIX for Windows," was presented by David Korn from AT&T Labs. He talked about the UNIX interface layer that he wrote on top of NT and Win95. Korn said there are few if any technical reasons to move from UNIX to NT and that the main motivation for his work was to serve those people who need the large collection of existing Windows software and the more familiar GUI but still want to run their favorite UNIX applications without having to make Win32-specific changes. The result of this was U/WIN ­ a set of libraries, headers, and utilities.

According to Korn, Microsoft has made the POSIX subsystem useless because there is no way to access any other functionality in addition to the 1990 POSIX 1003.1 standard. OpenNT, which tries to enhance this subsystem, still does not allow mixing of POSIX and Win32 calls.

Korn's U/WIN takes a different approach from OpenNT: he designed a UNIX interface that wraps around Win32. Thus, U/WIN consists of two DLLs (that implement the functions in sections 2 and 3 of the UNIX man pages) and a server process named UMS (which generates security tokens and keeps /etc/passwd and /etc/group consistent with the registry). UNIX applications can be linked with the libraries, and (if this step works) they can run under NT. U/WIN supports Universal Naming Conversion (UNC), fork() and exec(), special file names (e.g., /dev/null), and absolute file reference (e.g., /usr/bin). Signals are supported by having each process run a thread that waits on an event and is woken up whenever it has a signal to read. The termios interface is implemented with two threads connected via pipes to the read and write file descriptors of the process. stat() and setuid/setgid bits are stored using the multiple data streams feature of NTFS. Sockets are implemented on top of WinSock.

Some of the problems encountered while writing U/WIN were NT's filesystem naming (which does not allow certain characters in file names), the presence of special files (e.g., aux and nul), line delimiters in text files, inconsistent Win32 handle interfaces, inconsistencies in the way Win32 reports errors, etc. Korn also brought up an interesting point: if a UNIX emulation layer is considerably slower than Win32, then programmers will rewrite their applications to use the native Win32 calls, thus rendering the emulation libraries useless in the long run.

There are currently 175 UNIX tools (including yacc, lex, and make) that have been ported to NT using U/WIN. There are also a number of outstanding problems that still need to be worked out (e.g., authentication, concurrency restrictions, and permissions). Performance of U/WIN at this point is not very good. Although there is no loss in I/O performance, fork() is about three times slower than on UNIX, vfork() is about 30% slower, and file deletes are about two times slower. [Editor's note: Articles by Korn and Walli appear in this issue.]

KEYNOTE ADDRESS

Operating System Security Meets the Internet

Butler Lampson, Microsoft Corporation


Summary by George Candea

This keynote speech focused on defining what security is today, how operating systems achieve security, how networks achieve security, and how the two efforts can be put together.

Lampson pointed out from the very beginning that computer systems are as secure as real-world systems ­ neither more nor less. This translates into having good enough locks to prevent "the bad guys" from breaking in too often, having a good enough legal system that punishes "bad guys" often enough, and generally witnessing little interference with daily life. Computer users face the normal dangers of vandalism, sabotage, theft (of information, money, etc.), and loss of privacy. These plagues are typically caused by bad programs and people, where "bad" can be either hostile/malicious or buggy/careless.

In spite of the seemingly acute need for security, we still don't have "real" security. This is because the danger, overall, seems small, so people prefer to pay for features rather than for security. In addition, setting up secure systems is complicated and painful.

At the level of the operating system, users assume there is a secure communication channel to/from them. The OS then authenticates users by local passwords, and access to each resource is controlled using, for example, access control lists (ACLs).

The only difference in network security, according to Lampson, is authentication. In distributed systems, security becomes hard because the systems are very big and consist of heterogenous and autonomous parts that interact in complex ways. Such systems are also designed to be fault tolerant, meaning that they could be partly broken but still work properly; this makes authentication hard. Some systems try to circumvent these problems. For example, Web servers simplify things by establishing a secure channel via SSL, thus reducing the problem to that of OS security. Web browsers authenticate servers by SSL in conjunction with certificates (note that DNS lookup is not a secure way to authenticate servers). Browsers also authenticate programs by verifying digital signatures.

Lampson gave an overview of how OS and network security can work together, using the concept of principals ­ abstractions of people, machines, services, groups, etc. Principals can make statements, and they can speak for other principals. For secure distributed systems, we need to have network principals be on OS ACLs, allow network principals to talk for local principals, and assign secure IDs to network principals. As an example, he briefly described SDSI (Simple Distributed Security Infrastructure) and SPKI (Simple Public Key Infrastructure).

The talk was followed by an avalanche of questions, which clearly indicated the great interest that security generates and the fact that few people really know what security is all about. Lampson's first answers indicated he believes that firewalls are "the right thing." He also pointed out that people running <www.microsoft.com> should worry a lot about denial of service attacks, but not as much about the content, which can be easily regenerated. However, this situation could change in the future. He also expressed frustration with the fact that distributed systems research doesn't get deployed due to security concerns (the Web is not a real distributed system).

Lampson was asked what he thinks about the security of incoming email attachments, and he answered that, essentially, any form of executable code should carry a digital signature that is verified before the code can gain access to your system. In the context of auditing, someone asked what would happen if one got root access by some obscure means and then destroyed its trails. Lampson said that root needs to be fully audited as well, and if the system allows for the audit trails to be tampered with, "you're in soup."

The last set of questions had to do with distributed security systems. A member of the audience asked what the universal solution to revoking certificates would be. The answer was that one will always need to rely on some sort of timeout after which certificates are revoked.

Someone asked whether Lampson thinks worldwide authentication systems would become popular and whether it is the right model. He answered that, as long as the communicating parties agree on an encryption key, the system doesn't necessarily need to be worldwide. Another question concerned the vulnerabilities of such worldwide systems; the answer was (1) getting compromised and (2) it is difficult to get people to agree on things.

The last question referred directly to SDSI and asked how the system could be policed. Lampson said the legal system will work at its own pace (a couple of decades) and provide a framework for this. He also added that reliable auditing across the Internet is not possible.

CASE STUDIES

Deep Ports


Summary by Brian Dewey

A company faced with the challenge of porting an existing UNIX-based application to Windows NT has a choice of two strategies: a shallow port that preserves most of the application's UNIX flavor, or a deep port that involves rewriting key parts of the application to fit into the Windows NT model. This panel discussed the implications of the two strategies. Shallow porting is the easier route;

deep porting offers opportunities to optimize application performance.

Stephe Walli of Softway ­ a company that markets a tool that assists shallow-porting UNIX applications ­ presented the argument in favor of shallow porting: it's easy. The shallow port of the Apache Web server, for instance, took an afternoon. Deep ports require an investment of both money and time. There must be some payoff in either functionality or performance to make this investment worthwhile. Walli pointed out that most companies don't port their products to NT to gain functionality; this leaves performance as the primary motivation to undertake a deep port. However, only the most resource-intensive applications have much to gain from code tweaking. Developers of most products would be better off taking the easy route.

For those who need to squeeze performance out of their applications, Ramu Sunkara of Oracle and Steve Hagan of Top End described their experiences deep porting database applications. Steve gave the encouraging news that 80% of the code will be a "nice port" and require very little programmer involvement. For the 20% of performance-critical code, the two offered the following advice:

  • Avoid the C Runtime Library. The Win32 API is incredibly rich and will provide most of the tools you need to get your job done. Use the API directly; performing the equivalent CRT (C runtime) calls merely adds an additional layer of code.
  • Use threads and fibers to maximum advantage. Windows NT is a multithreaded operating system, and the addition of fibers ­ lightweight, application-scheduled threads ­ gives the developer a high degree of control over CPU usage. Ideally, there will be exactly one active thread for every CPU on the system; this will provide maximum CPU usage without the overhead of context switches. Achieving this goal is a tough challenge for the developer. Also, use the thread and fiber APIs directly instead of a thread package; as with the C Runtime Library, there is no reason to add an additional layer of code.
  • Take advantage of the flexibility of the I/O subsystem. The NT I/O system gives immense power and flexibility to developers. For those fluent in the API, it is easy to bypass the file cache, perform asynchronous I/O, and hook your own code deep into the filesystem.
  • Take advantage of the NT security model. This is an area that straddles performance and functionality; the NT security model is incredibly rich. By using it directly, you will not only lose a middle layer of software in your application, you will also gain additional flexibility.

BOFS

Blind Geese (Porting BoF)


Summary by Brian Dewey

Intel Corporation coordinated a birds-of-a-feather session about porting UNIX applications to Windows NT while maintaining good performance. Inspired by the name "birds-of-a-feather," they opened this BOF session by talking about geese. A flock of geese (they're a gaggle only when they're on the ground) flies 70% farther by cooperating in their V formation. The lead goose guides the flock and does a large amount of the work; a trailing goose, flying in the slipstream of another bird, experiences less wind resistance. When the lead goose tires, another takes over.

Likewise, Intel said, researchers would be more productive if they shared their experiences. This session provided a forum for that: an opportunity for members of the community to describe the problems they've had with NT performance and get help from fellow members. After a slow start, the audience warmed up and started raising many issues about NT performance. However, Intel's analogy, although cute, was too optimistic for such a young community. There was no "lead goose" for this BOF; nobody had the experience to give definitive answers for most of the questions raised.

Perhaps more problematically, NT is new territory for most of the audience; this impacted the quality of the questions. Only one person raised an issue and had numbers to quantify it: when doing a performance analysis of WinSock2, he noticed that using certain message sizes caused a dramatic reduction in throughput. He could find no pattern to predict what message sizes would be bad and wanted to know if anyone else had witnessed this behavior. (Nobody had.) Most of the other performance issues were based on vague impressions. Typical statements were "I noticed the file cache doesn't perform well when accessing large network files" and "My NT Server runs unexpectedly slowly when executing multiple interactive sessions." Without quantification, however, these issues are difficult to address.

In spite of all this, the BOF session had the right idea. People will encounter problems as they move to NT. The more opportunity to discuss these problems with other members of the research community, the smoother the process will be. I suspect that, in a year's time, members of the community will have gained the experience to ask good questions, give good answers, and make this BOF live up to its image.

Windows NT Futures


Summary by Brian Dewey

Frank Artale, Microsoft's director of NT program management, and Felipe Cabrera, architect in charge of NT storage systems at Microsoft, ended the workshop with a question-and-answer session about the future directions of NT. The session was informal. Artale's brief opening remarks seemed to be the only prepared part of the presentation. A single slide listing important topics in NT's future ­ storage, I/O, multiprocessors, clusters, memory, and management ­dominated the projection screen for almost the entire talk. Once Artale put that slide up, he opened the floor to questions. And the questions poured in ­ so many, in fact, that an hour into the session, Artale and Cabrera were still fielding questions on the first topic.

So what information did the two panelists reveal about disk storage? First, they acknowledged that NT 4.0 has some deficiencies ­ for instance, if you ever try to run the chkdsk utility on a volume with thousands of files, you might find yourself waiting an uncomfortably long time (i.e., hours!). The NT File System (NTFS) fragments files more than most users would like. NT 5.0 will attempt to correct both of these problems: in the first case, by reducing the number of times a validity check is necessary and in the second by improving the disk allocation algorithm.

Second, Artale and Cabrera spilled a "technology piñata" upon the audience ­ a deluge of interesting but loosely related improvements to NTFS. Some of those improvements deal with loosening some of the constraints on NTFS present in NT 4.0. For instance, the new version will be able to support 248 files, up to 248 bytes in size, on a volume.

Additionally, NTFS will be able to support sparse files ­ say, if you had a megabyte of data distributed over a terabyte file, NTFS would store just the megabyte on disk. Further, NTFS in NT 5.0 will have several new features. The most frequently cited is content indexing; this will allow a user or a Web server to search a hard drive more efficiently. NT 5.0 will also sport symbolic links and a change journal that tracks modifications made to files (an essential component of the Coda-like filesystem).

A few thoughts struck me as I watched this question-and-answer session unfold. First, the audience seemed more interested in Windows NT than on any research being done on Windows NT. What else could explain the flood of questions about the unglamorous topic of disk storage? I don't think this phenomenon resulted from Windows NT merely being the common denominator of the diverse crowd.

To give my own biased evidence: I wasn't bored for one instant during this presentation, even though it came at the end of three days of diligent notetaking. While much of the credit for this goes to Artale and Cabrera's engaging style, I think the primary reason is that Windows NT is interesting in its own right.

Second, this presentation emphasized how there aren't many NT gurus outside of Microsoft. Sure, there are plenty of people trained in Windows NT, but the training seems to amount to knowing which menu option to check to achieve a desired result. Not many people know exactly what happens once that menu is clicked.

The UNIX world lives by a vastly different standard; not only is source code easily available, but there is also a large array of books that explain, in detail, the inner workings of the UNIX operating system. Without a steady supply of information, members of the research community are forced to wait for opportunities like this session to pose their questions to the insiders.


webster@usenix.org
Last changed: Nov 12, 1997 efc
Summaries Index
Workshop Index
Proceedings Index
USENIX home