The following paper was originally published in the Proceedings of the USENIX Conference on Object-Oriented Technologies (COOTS) Monterey, California, June 1995 For more information about USENIX Association contact: 1. Phone: 510 528-8649 2. FAX: 510 548-5738 3. Email: office@usenix.org 4. WWW URL: https://www.usenix.org Media-Independent Interfaces in a Media-Dependent World Ken Arnold Ken.Arnold@east.sun.com Sun Microsystems Labs 2 Elizabeth Dr. Chelmsford, MA 01824 Kee Hinckley nazgul@utopia.com Utopia, Inc. 25 Forest Circle Winchester, MA 01890 Eric Shienbrood ers@wildfire.com Wildfire Communications 20 Maguire Rd. Lexington, MA 02173 Abstract Wildfire is a communications assistant that uses speech recognition to work over phone lines. At least that's what it is today. But in the future it wants to run on desktops, PDAs (like the Newton Message Pad), and who knows what all. To provide a level of media independence, we designed a subsystem to isolate the communications knowledge of the assistant from the mechanisms of prompt/response. This layer is called the MMUI. It provides abstractions of input and output that let the assistant ask questions and get responses without knowledge of the specifics of the communica- tion channels involved. The specifics of speech recog- nition, as well as the degree of abstraction desired, make this an interesting case of presentation/semantic split using object polymorphism. This presentation will cover the design of the MMUI, its fundamental weaknesses, and furious handwaving over future directions to mend them. 1. Introduction The Wildfire communications assistant is designed to use computer analysis and assistance to enhance com- munication, both with other Wildfire users and with the outside world. To do this, the interface is critical: it must be natural and easy to use, engaging without wasting your time, and so on. There is nothing in any reasonable requirement list that says it must only work over voice interaction, and in fact, future expansion pretty much demands that it be more flexible than that, for example to operate with tty lines for the deaf, text- based two-way pagers, and eventually pen-based PDAs and GUI-based desktops. However, much of the value added of the system has nothing in particular to do with the presentation of the interface. The primary value is provided using knowl- edge of how and when to get in touch with a person, who has called and when you need to call them back, who is currently on hold and who is important, and how to weave these facts into more effective assis- tance to the user. The concept of a presentation/semantic split in appli- cation design is well established [1,2,3,4,5], so it was obvious that a presentation layer needed to be pro- vided for developing the Wildfire assistant. This layer, however, has several additional requirements: * It needs to be able to handle completely linear interactions. A voice interaction is like a con- versation, in which the assistant asks ques- tions, waits for a response, and then either gives feedback or asks another question. * Voice interactions have a different quality to the interaction than a GUI does. As an exam- ple, because of the nature of speech recogni- tion, a choice from a menu is not a single se- lected thing, but a list of probability-ordered possible responses. * In any single interaction, user input can come from a variety of source. For example, you can either speak your responses, or you can use touchtone shortcuts. * Speech recognitions systems can require train- ing, resulting in a significant collection of user-specific data associated with the general menu. So beyond the normal presentation/semantic consider- ations, there was a need to interact with the user inde- pendent from the specific media through which the interaction was taking place. Presentation could be through recorded voice, text-to-speech, plain or inter- nationalized text, or a GUI presentation mixing images and text. Input could be recorded voice, recognized voice, text, images (such as faxes) and so on. The independence of the need for interaction from the media through which the interaction takes place, and the great value represented by the underlying commu- nications enhancement independent of the interface, made it attractive to create an abstraction capable of isolating the assistant code from the particular media(s) through which the user was interacting. We also added the following requirements: * It must be easy to add new types of media. It is impossible to predict who will make the next advancement in price or functionality in speech recognition, or to predict the winner in the two-way pager marketplace. Adding new kinds of media should require relatively mini- mal work that is completely hidden from the assistant. * It should be possible to select particular media based on other attributes. While providing a layer of abstraction for the variability of me- dia, adding a generic variability mechanism to select for, say, the desired prompt verbosity seemed like a small addition for a large gain. 2. The MMUI The abstraction we designed is called the MMUI (MultiMedia User Interface). It lets the assistant inter- act with the user through a very detached abstraction using units of meaning. (For those of you who want to follow along with pic- tures, Figure 1 shows the class hierarchy for most of the classes described in this paper.) The basic meaning abstraction is the Meme. When the assistant wants to say "Hello", it doesn't care if the presentation is in text, voice, or video, or even what language is used. What matters is that the user is pre- sented with representation of the concept of "Hello" that is meaningful to them. Each representation of the concept "Hello" does, in the end, have to be presentable in some way via a specific representation understood by a particular device, such as an audio board or an ASCII stream. Specific pre- sentable data are represented by Media objects. The media class is an abstract base class from which spe- cific media representation classes are derived. The "Hello" meme would thus contain several media objects of various types, such as a TextMedia object containing the string "Hello" and/or an AudioMedia object that describes a recording of someone saying "Hello". Thus, Meme is a concrete class that has a set of media that all represent the same concept through specific presentation possibilities. When new kinds of media are added to the system, the Meme class does not change, and hence the assistant need not change either. Note that, of the two requirements on the media within a meme (that it be presentable and equivalent in mean- ing), the MMUI is concerned only with presentability, not meaning. It assumes that some human took respon- sibility for properly correlating the meme's contents to the meaning it is intended to represent. All this is sufficient for output, but output is the sim- plest part of the system. Interpreting user choices gets more complicated, but builds on the same basic con- cept of grouping identical conceptual meanings that require different specific interactions into a single con- ceptual unit. So we extend the basic Meme concept with a derived class that represents a set of choices. This is called a Menu. It contains an ordered list of rows, each of which has a meme to represent its presentation and program-supplied data to be returned if it is selected. In order for it to be presented, and the user input inter- preted, some expert code is required that understands how to use the data in the menu to recognize. This expert is called a Mogul. A TextMogul would under- stand how to present the text media from each meme to the user as, say, a numbered list and let the user type in the number of the choice they wanted. A voice-rec- ognition Mogul will, as we shall see, be more compli- cated. Presentation and recognition are done through an abstraction called an ActiveGadget. This represents an active one- or two-way channel between a process (the assistant) and a gadget (like a telephone) that is cur- rently active, i.e., able to be interacted with (interact- ing with a telephone that does not have a person using it, and is thus not "active", is beyond the scope of this paper). Memes are presented on the channel, and rec- ognition is done through it. ActiveGadget and the base Media class are both derived from a base Attributed class that allows setting arbitrary name/value pairs on an object. The attributes on the ActiveGadget are used to store the output filter- ing preferences. The attributes on a Media object describe the attributes that particular media have. Matching between these values allows further refining of the list of Media that can be presented, once irrele- vant Media (like voice media on a tty line) have been weeded out. This could be used to select verbose vs. non-verbose prompts, male vs. female speakers, etc. More on this later. Before we talk about how this is translated into classes, objects, and other code-related realities, we will need a quick overview of the Wildfire architec- ture. 3. Wildfire Architecture For the purposes of this paper, we will describe a sim- plified view of the Wildfire architecture from a MMUI-centric point of view. Basically, the Wildfire system is broken into a set of applications (of which the most visible is the assistant) that run on a special- ized kernel called the Wildfire OS (WFOS for short). This exports functionality for communicating via ActiveGadgets to devices that support particular capa- bilities (such as speech recognition or recording). The assistant gets an ActiveGadget by handing a Gad- get description to the WFOS and getting back an ActiveGadget object that represents a channel of com- munication to that gadget. The assistant then requests capabilities of that ActiveGadget, which (if the addi- tion is successful) will ensure that a port is attached to support that capability. There are three major kinds of ports, all derived from an abstract Port class: InputPort for recording user input, OutputPort for presenting media, and Recog- Port for ports that turn input into recognized selec- tions. Currently a channel that represents normal communication with an assistant over a telephone line has the following ports attached to it: * An InputPort that can record sound, such as a person leaving a message. * An OutputPort to play pre-recorded sounds, such as system prompts. * A speech recognition port that is derived from both OutputPort and RecogPort so that it can coordinate playing the sounds with setting up the recognition. * A RecogPort to recognize touchtone selec- tions. As another example, a channel communicating to a user across a file descriptor currently has the following ports: * An InputPort that records ASCII text. * An OutputPort that prints ASCII text. * A RecogPort that lets the user select an entry from a text menu. 4. And Now, Back to the MMUI Now we can discuss in greater detail what the MMUI actually does. First of all, let's examine what's in a Meme. A Meme has a set of Media-derived objects, each of which describes a representation of a single concept. Differ- ent media objects might be of the same C++ type, but still be logically distinct due to attribute differences. For example, the "Call Whom?" Meme might have all the following Media objects: * A TextMedia object with the string "Call Whom?" * A TextMedia object with the attribute "Length=Tutorial" and the string "Call Whom? Please say the name of one of your contacts." * An NMSAudio object with a pathname to a recording of a person saying "Call Whom?" * An NMSAudio object with the attribute "Length=Tutorial" and a pathname to a re- cording of a person saying "Call Whom? Please say the name of one of your contacts." There might be other recognized values for the Length attribute, such as "Brief" for which the representation might be "Whom?" or (for the audio) nothing at all. There might be other attributes, such as "Speaker=Female/Male" to allow the user to select a female or male voice for the prompts. For prompts that include the gender of a person (like "She's not here") that could be encoded in a "Subject=Male/Female" attribute. There are a set of system memes, which are simply well-known memes within a particular name space. These are referenced through MemeID objects that uses the name of a meme to get a pointer to the actual meme from a table of system memes. These memes are not special in any other way beyond being entered in this name space. Many memes are created on the fly, such as user names, contact names, and message bod- ies. 4.1. Presentation Now let us examine how the actual presentation of a Meme works. We will work on the example ag << WfCallWhomP; res = ag.prompt_response(WfOutCallM); Memes are buffered until an explicit flush or a request for input. Thus, the prompt_response() invoca- tion causes the WfCallWhomP prompt meme to be presented. Here is how that happens: 1. The WfCallWhomP MemeID is translated into a Meme reference. 2. ActiveGadget's operator<< method sends the Meme list (of one Meme in this case) to the WFOS to present through the channel. 3. The channel asks each attached OutputPort to present the Memes that have Media which that OutputPort understands. 4. Each OutputPort searches the Meme for Media objects of types that it understands. If it finds more than one such object, it does an attribute match to find the one whose attributes best match those set in the ActiveGadget. If there is a tie for best match, one of the best is picked randomly. 5. When no more Memes are left to present, pro- cessing is complete. When reducing the mass of possible Media down to the one that will be presented, the primary filtering is, of course, on presentable Media. For many Memes this may be enough. Attribute matching only matters if more than one presentable Media exists in that Meme. Attribute matching does not take adjacent Memes into account to smooth matching across a list of Memes (mostly because we've never found a use for it). Ports find Media-derived objects that they understand using a runtime typing system. The NMSOutputPort will try and find Media that are at-least-a SimpleAudio media, which is a derived type that describes basic mulaw audio. The polymorphic behavior of Media and the various Port classes gives the MMUI great adap- tive power. We will describe this in detail below, when we discuss how one would go about adding a new type of interaction to the system. Also note that there is no 1-to-1 requirement between Ports and specific Media types. A Port may handle more than one kind of Media, and any number of Ports may understand a specific Media type. Since Ports are attached to channels because of requests for capabili- ties, it can be quite useful if a Port handles more than one Media, since it might reduce the number of Ports required to support a given capability. Attribute matching is done in a very simplistic way. First, we take a list of the attributes meant to qualify the presentation, i.e., the attribute set from the Active- Gadget is overlaid with the set overriding the current presentation. (This can done with an option object analogous to the iostream manipulators.) For each attribute specified in the qualifying list, each present- able Media object is assigned a score: 2 for a match, 0 for a conflict, and 1 for a non-conflict (in other words, if the attribute in the qualifying list is absent from the Media, and hence, neither set correctly nor incor- rectly). The presentable Media object with the highest score wins. This can mean that an object that directly conflicts with the qualifying list is presented, on the theory that the wrong output is better than no output at all. Wrong output tends to lead to complaints, which can often lead to getting the problem fixed. Silence tends to merely baffle. 4.2. Recognition Recognition is somewhat more complex than presen- tation. This is partly due to the fact that more stuff has to go on in a two-way interaction than a one-way data dump, but voice menus also have an attribute rarely found in other menu systems: probability. When someone selects the third entry in a menu, that's what they've selected. When a speech recognition system tries to determine what you've selected, it can only return a list of probable selections. Its matching is only the best it can do. (The details of translating the actu- ally output from the speech recognition algorithm into a probability is an interesting problem, but it is not in the scope of this paper; see [9,10,11].) This means that the output of a menu selection in a media-independent interaction is a list of candidate selections with some probability assigned to each. If the highest probability falls below some threshold, then the recognition must be considered a failure and handled appropriately. We notify the user in increas- ingly verbose ways that we didn't understand them, and then we finally give up on the whole command. (The issues surrounding the human factors of deciding how to handle these cases is another interesting, yet uncovered, topic; see [6,7,8].) The menus can usually handle this kind of reprompt- ing given the appropriate options (such as what to say when reprompting, and what level of probability is acceptable). There are, though, instances where the selection must be returned to a higher level. Let's walk through the recognition request: 1. WfOutCallM is translated into a Menu refer- ence. (There are system menus just like there are system memes, and the lookup is handled analogously.) 2. The menu, along with any specified options (none shown) are sent to the WFOS. 3. The WFOS waits for any asynchronous output request to finish. 4. It then asks each OutputPort to prompt/recog- nize on the pending output. This allows recog- nition ports to coordinate the ending of the prompt and the start of recognition. 5. The first RecogPort to say it is complete wins, i.e., its recognition result is the result of the overall recognition (no averaging or other in- ter-port data merge is done). 6. The responses are sent back to the MMUI, which asks the winning Mogul to translate the results into a canonical form, which is a list of MenuPick objects ranked by probability. (A MenuPick object holds the probability for that pick, an index into the Menu, and a pointer to the assistant-specified data associated with that row of the menu). 7. This list of MenuPicks is returned as the result, along with an overall confidence in the recog- nition, and (if appropriate) a failure status to distinguish failure due to timeout from failure due to unrecognizable noise. If the recognition fails, step 6 instead consists of "bonking" the user with any other corrections neces- sary, all as specified by options provided to the menus via the ActiveGadget (which maintains the current default option settings) or specific overrides which can be given as an optional parameter to the prompt_response(). Step 4 is where Moguls come into play. Speech recog- nition systems typically need a block of data that rep- resents the entire menu of choices. This describes (in some device-dependent way) the differentiating aspects of the legal things a person can say. A common term for this is the vocabulary. When you want to rec- ognize words from a particular vocabulary, you must download the data into the device, and only then can you start recognition. This means that a simple list of the possible choices is insufficient to represent a Menu on all input systems. While a list of Memes containing TextMedia would be sufficient to build a text or GUI menu on the fly, vocabulary building is a time-consuming process that offloads overhead from the recognition phase to a vocabulary building phase that must precede it. The Mogul was introduced to represent any overall menu- related information required by a RecogPort, such as vocabularies. We will deal more with how vocabular- ies are built below. The coordination in step 4 is needed because, on some speech recognizers, if you do not coordinate the input and output, you can get very bad effects. On some sys- tems, starting to recognize a person's voice while sound is still being played over the phone can lead to very bad effects, since you can start recognizing your own prompts as commands. On the other side, the prompt may have finished, but the recognition port may not yet be ready, i.e., the vocabulary may not have yet finished being downloaded to the device. Wildfire plays a little blip sound when it is ready to start recog- nition. This lets you know that, should the prompt fin- ish too soon, you still need to wait. Otherwise people would start talking before the recognition window began, and Wildfire would start trying to understand them midway through their utterance. So this blip is coordinated with the output so that it does not play until the recognizing hardware is prepared. We do not attempt to average results from multiple ports because we cannot see any meaningful way to do this in general. It is hard to even imagine that this can be usefully attempted. Imagine that the user had, when asked whom to call, said "Georgina Whit" and, before the requisite pause to signal the end of speaking, pushed the touchtone for "Never Mind" (the cancel command). How would one average such input? One could imagine cases where combining various inputs you could increase the correctness of the recog- nition. On a video phone, for example, being able to match lip movements against what the speech recogni- tion thought was said might be able to help discrimina- tion between possibilities. (I did say one could "imagine" such a thing.) To do this requires sophisti- cated interactions between the data available from multiple sources, and logically, within the MMUI, belongs in a single RecogPort that examines and cor- relates the relevant data. Neither the WFOS nor the MMUI could possibly broker this interaction in a gen- eral, abstract way applicable to other types of ports balancing multiple inputs. 4.3. Training There are speech recognition systems that work on "speaker-independent" recognition. This means that any person should be recognized without having to train the system about how they personally speak. Although this is ideal in theory, even in the best cur- rent systems there are people with heavy accents or speech disabilities who cannot successfully be recog- nized. For this and other reasons, Wildfire (and hence, the MMUI) must support user-specific training for vocab- ularies. A large subsection of the MMUI is devoted to training, and the requirements of speech recognition training constrained parts of the design. The MMUI supports a training call that sorts the user's menus based on what could use the most training, and presents them to the user one-at-a-time for training. To train a single menu, the user is asked to say each word. Each mogul then uses that data to update its information. Currently, only the speech recognition mogul VPC- Mogul uses this step, but it is critical to its operation. The user's provided training for each word is added incrementally to the vocabulary so it can take effect immediately after the training is finished. At a later time, a batch process notices that there are new train- ings for a menu, and it rebuilds its vocabularies in a more compact, and effective, form. If you had, in some way, a recording of the user saying a word, you could add this without interacting with the user, but the vocabulary rebuilding work would still be required, which relies on access to the speech recogni- tion device. This means that it is not possible to simply add a new meme to a menu, or to modify its contents by adding a new AudioMedia to change the recogni- tion for that menu entry. Such a change requires active intervention by some code that understands how the device works. This is the Mogul's job. When a new row is added to a menu, each associated mogul is "introduced" to the new row's meme. This also hap- pens when a row's meme is replaced; there is no mech- anism for changing the contents of a meme in a menu except by wholesale replacement. It would clearly be possible to design a method to do so, but we have not yet needed to. And, obviously, when a row is deleted, all the menu's moguls are notified of that, too. Currently, whenever a menu is created, all Mogul types are created and attached. There is a design, not yet implemented, to make this more dynamic based on content, but this wholesale approach has not yet proved to be a problem. 5. Extending the MMUI One of the important design centers for the MMUI was the ability to extend the system by adding new kinds of interactions. There were several reasons for this requirement: * Replacement hardware is constantly becom- ing available. If a new audio-playing board comes out that is much cheaper to use, it should be easy to change the system to use it, thus allowing a quick reduction in the price of shipping systems. * New capabilities are coming quickly, too. The sophistication of speech recognition systems is increasing rapidly, and the MMUI should not be a bottleneck when deciding how quick- ly we can use better solutions. * Completely new interactions should be quick to add. As two-way pagers become lighter and more widespread, we should be able to add them to the full system (for text-based ones) or for specific interactions (such as saying who is calling and asking if you'd like to take the call). The list of other possibilities is as long as your wired imagination can make it. Again, the MMUI side of this should not be the con- trolling factor in the schedule. So we designed a system in which the code that must be added is isolated under five primary abstract base classes, Media, Mogul, Port, Gadget, and Channel. * Media: If presentation will be done on the new device, a new Media will probably be needed to describe presentable data. * Mogul: If recognition will be done on the new device, a new Mogul type will probably be needed to manage that complexity. * Port: New input, output, and/or recognition ports will be required that understand how to work within the WFOS to drive the device. * Gadget: Adding a new device may require adding a new Gadget to describe an address for the device * Channel: A channel understands how to es- tablish and terminate connections to Gadgets (e.g., how to dial the phone and hang up), and coordinate between multiple ports on a partic- ular kind of gadget. We have only briefly touched on the Gadget class. Gadget is an abstract class whose derived classes con- tain addresses that can be used to connect to specific targets, establishing a channel for an ActiveGadget. The details of that handshake are not terribly interest- ing here, but any device we talk to must be contacted at an address, and hence, must have a Gadget-derived type that contains that address. For a PhoneGadget the address is a phone number; for a NetworkGadget it is an internet address and port number that will be used for the network connection. We have also not discussed Channels in any detail either, but they are rather straightforward. They are created to carry media to destinations specified by par- ticular Gadget types. If the new device does not require a new Gadget type (e.g., it operates at a network address, for which Wildfire already has NetworkChan- nel type), it will not require a new Channel. If it does, the Channel will have to be able to resolve the address described in the Gadget to a Channel that can juggle the communication needs of that type of Gadget. Like Port, a new Channel is not very complicated beyond whatever is required by the device that connects to the machine. The NetworkChannel is trivial, since sockets are easy to manage. The PhoneChannel is more com- plicated because telephony has more intermediate and failure states. Not all new systems will require all of these. Besides the obvious fact that output-only systems will not require recognition Moguls, new devices that work with phone lines for addresses would not require a new Gadget or Channel type. This is likely to be common; a fax system would need a new Media and Port type, but no new Gadget would be required. To give a flavor of how we would add a new device to the MMUI, we will describe how to do this for a puta- tive video-based system. It is quite likely that the new video would be reachable either on the local network via an internet address, or over the phone like a video-conferencing system. Both of these Gadget and Channel types already exist in the Wildfire system, so we will just piggyback on them. If a new address was required (some video systems, for example, require two phone lines to handle the volume of data), a new Gadget type would have to provide a way to hold such an address. A new channel type that recognized that address and knew how to establish a connection to the addressed video system would also have to be added. Since this is not the primary topic of the paper, we will skip the details of this mechanism, but suffice it to say that it is not very complicated, except whatever complication the video system itself may impose. For this one needs a VideoMedia class, derived from the abstract Media class. The Media class doesn't require much of its derived classes; it mostly (being derived from the Attributed class) ensures that Media can be attributed. Almost all functionality is added in the specialized classes. The VideoMedia would pre- sumably store a pathname of a file that contained the media, and probably a start and end frame within the file. Providing input and output ports for the video system is again relatively straightforward. New classes would be derived from the InputPort and OutputPort classes that overrode the pure virtual record() and present() methods respectively. The record() method's main job would be to dump the video into an appropriate place, and create a VideoMedia object that described it. The present() method would paw through the Media in the list of pending output, looking for Video- Media objects. If it found one, it would do the same for the next meme in the list, continuing on until it reached either the end of the meme list, or a meme that didn't contain any VideoMedia. It would then peel off the memes it could reasonably present, and, for each one, find the best attribute matched VideoMedia and present it. As it currently stands, we could use the system described only for input and output, but not for recog- nition. We could take a video message, or play a video clip, but we couldn't ask any questions. Let us pre- sume that this video system has an attached keyboard for answering questions (since I suspect a sign-lan- guage gesture recognition system is currently a tad beyond even the limits of handwaveware). It is possible even here that we can avoid any hard work, since there already is a text recognition port, and if the video system understands simple ASCII I/O, a TextRecogPort would fulfill our engineering needs. It would not, however, fill our pedantic needs, so we will assume that this is not so. So we need to add a VideoRecogPort and a VideoMo- gul. The Mogul class requires its derived classes to handle notification of changes in the Memes of the Menu, and to be able to canonicalize the list of Men- uPicks into a form presentable to the application. The VideoMogul would be interested in the TextMedia of the menu so it could create a menu on demand (any Mogul can look at any Media to do its work). It might also be interested in VideoMedia that presented the choices in a video form, should the user want some help understanding the available choices. The VideoRecogPort would present the list of choices as text and allow the user to select one in some way. Notice that this does not preclude doing simultaneous speech recognition on the video system's micro- phones. Just as touchtones and voice can coexist on a single ActiveGadget, so can voice and the video sys- tems text selection mechanism (and touchtones too, if desired). Whichever got an answer first would govern any particular recognition, but each recognition could be responded to in any available way. 6. Current State The current state of the MMUI allows a linear presen- tation style with quite a lot of separation between the internals of the code and the particular media presen- tations required to present, record, or recognize in a particular case. However, as currently implemented, the MMUI still has major weaknesses if one is to con- sider it as a general interface abstraction. First, it has no concept of "sentence". The presentation of Memes is linear in nature, with the order specified by the assistant. This is one problem (of several) that makes porting the assistant to a different language dif- ficult. The natural or allowed order of presentation could be quite different. There is also no concept of "dialog". Most actions require more than one piece of data. Again, different languages may impose a different expected or required order on gathering the data. For example, in English it is natural to say "Call Gordwina at work", but another language may prefer "Call the workplace of Gord- wina". Further, a particular piece of data may affect the valid values of another one; Gordwina may or may not have a work phone. There is also the issue of context. In a conversation, much data can be left out, inferred by the context. Overall context (for example, the subject of discourse) is easily built into the assistant interactions, since it is obvious and shared - we are talking about calling people, for instance, by the nature of the dialog, and so the dialog designer can make the interface understand that context. But specific context is harder to provide. It would be nice to use words like "it" or "them". But without a notion of the type of legal referent, and the history of previous interactions that might state or imply a refer- ent of that type, such words can only be used in a highly constrained way. These problems limit the use of the MMUI to linearly presented interfaces. If one wanted, for example, to add a desktop GUI interface, one could only do so using a series of single-question prompt-response dia- logs. Graphical icons could be included as part of the dialog description media, but it would not be possible to pop up a full "Place A Call" dialog that let one spec- ify both whom to call and where to call them. Designs exist to address these problems by adding classes and protocols to represent each, but as yet no prototyping has been done to try and prove them actu- ally useful in the cauldron of the real world. On the other hand, the Wildfire system is currently available, and is built using this infrastructure. It is quite possible to interact with the assistant using inter- mixed voice and touchtone commands, to listen to data recorded in different formats (hence, having different Media objects representing them), and to use attributes to select particular media (tutorial vs. standard prompts). As an experiment, we added a new channel type able to talk across a text-based two-way pager. With no modification to the assistant, it was possible to see the prompts and respond via the pager's key- board just as one did when communicating via a Net- work Gadget to a local terminal emulator. The MMUI has proven to be a useful tool in isolating many details of the interface presentation from the duties of the assistant. It has proven its adaptability to different flavors of linear presentation. It is easy to add new types of presentation and recognition, and should be possible to extend to provide greater isolation to the presenter. These benefits make it a useful abstraction in designing media-independent interfaces that must be presented in a media-dependent world. Acknowledgments Tony Lovell, Vinnie Shelton, Keith Gabryelski, Rich Miner, Greg Cockroft, and Dave Pelland contributed significantly to the design and implementation of the ideas presented here. Bill Warner created the concept of the WIldfire Assistant which motivated this design. References [1] Eguene Cicarelli, "Presentation Based User In- terfaces", Thesis, MIT AI Lab, Technical Re- port AI-TR-794, 1984 [2] Pedro Szekely, "Modular Implementations of Presentations", Proceedings SIGCHI+GI 1987, pp. 253-240. [3] Pedro Szekely, "Separating the User Interface from the Functionality of Application Pro- grams", Thesis, ECMU 1988. [4] Scott McKay, William York, Michael McMa- hon, "A Presentation Manager Based on Appli- cation Semantics", Proceedings SIGGRAPH Symposium on User Interface Software and Technology 1989, pp. 141-148. [5] H. Rex Hartson, Deborah Hix, "Human-Com- puter Interface Developments: Concepts and Systems", ACM Computing Surveys, 21:1, pp.5-92. [6] Candace Kamm, "User Interfaces for Voice Applications" Voice Communication Between Humans and Machines, National Academy Press, Washington, D.C., 1994. [7] Eric Ly, Chris Schmandt, "Chatter: A Conver- sational Learning Speech Interface", AAAI Spring Symposium on Intelligent Multi-Media Multi-Modal Systems, Stanford, CA, March 1994. [8] Nicole Yankelovich, Gina-Anne Levow, Matt Marx, "Designing SpeechActs: Issues in Speech User Interfaces", Proceedings, SIGCHI `95 Conference on Human Factors in Comput- ing Systems. [9] Gordon E. Pelton, Voice Processing, McGraw- Hill, 1993. [10] L. R. Rabiner, B. H. Juang, Digital Processing of Speech Signals, Prentice-Hall, 1978. [11] Kai-Fu Lee, Automatic Speech Recognition, Kluwer Academic Publishers, 1989. -------------------------------------------------- Meme Menu Attributed Media TextMedia NMSAudio *VideoMedia* ActiveGadget Mogul TextMogul VPCMogul *VideoMogul* Port InputPort NMSInputPort *VideoInputPort* OutputPort NMSOutputPort *VideoOutputPort* RecogPort VPCPort *VideoRecogPort* Gadget PhoneGadget NetworkGadget Channel PhoneChannel NetworkChannel FIGURE 1. MMUI Class Hierarchy --------------------------------------------------