################################################ # # # ## ## ###### ####### ## ## ## ## ## # # ## ## ## ## ## ### ## ## ## ## # # ## ## ## ## #### ## ## ## ## # # ## ## ###### ###### ## ## ## ## ### # # ## ## ## ## ## #### ## ## ## # # ## ## ## ## ## ## ### ## ## ## # # ####### ###### ####### ## ## ## ## ## # # # ################################################ The following paper was originally published in the Proceedings of the USENIX Summer 1993 Technical Conference Cincinnati, Ohio June 21-25, 1993 For more information about USENIX Association contact: 1. Phone: 510 528-8649 2. FAX: 510 548-5738 3. Email: office@usenix.org 4. WWW URL: https://www.usenix.org The Ferret Document Browser Howard P. Katseff Thomas B. London AT&T Bell Laboratories Holmdel, NJ 07733 Abstract The Ferret Document Browser is a vehicle for exploring the design and use of document storage and retrieval systems. Its distributed, modular structure allows independent infor- mation providers to control their data, yet make use of a common access and billing control facility. Document images are distributed via a nationwide AT&T corporate internet which consists mainly of Ethernet networks interconnected by leased data circuits. The relatively low bandwidth of this networks is dealt with by compressing the documents for transmission, and by decompressing pages as requested on the workstation. A page image can be decompressed and displayed in less that a half second. A broadband version of the system makes use of the BBFS broadband file server, the HPC interconnect, the LuckyNet broadband network and the Liaison network multimedia work- station. This system allows document browsing at rates up to 15 page images per second. 1. Introduction The Ferret Browser is a vehicle for exploring the design and use of network-based broadband, wideband, and narrowband image storage and retrieval systems. It provides wide area distribution of documents and images via a nationwide AT&T corporate IP network which consists of Ethernet and Frame Relay networks interconnected by gateways and leased data circuits. The relatively low bandwidth of this network is dealt with by compressing the images for transmission, and decompressing it as requested on the workstation. On a typ- ical workstation, an image can be decompressed and displayed in less that a half second. The browser has been integrated with the AT&T Information Services Network's LINUS database system which provides document search and selection services and ensures that documents are viewed only as authorized. Ferret's modular and distributed design permits the colla- borative support of multiple information providers. In par- ticular, the image databases are located at different AT&T locations and maintained by different organizations and the Ferret servers may be accessed from database systems other than LINUS. The Ferret viewing software has been widely distributed to the AT&T R&D Community and is currently being used by nearly 1000 people. It runs on a wide variety of UNIX workstations using X windows, OpenLook, or Motif. It supports a wide range of industry standard document and image formats and currently provides access to several image databases, including 21,000 AT&T Bell Laboratories technical memoranda, and 9000 photographs from the AT&T archives. We also describe an experimental version of Ferret that gives far better performance: it can display pages at a rate of 15 per second via a broadband network. 2. Widely Accessible Browser Since 1989, the AT&T Information Services Network has been scanning internal technical memoranda at 400 dpi (dots per inch) and storing the images on write-once optical disks[1]. Approximately 21,000 documents are currently stored, and about 40 new documents are scanned each day. Requests for copies are fulfilled by printing the document on a 400 dpi printer at a central site and sending it out via company mail. While this system is a large improvement over its predecessor, where a clerk located the original document in a filing cabinet and made a xerographic copy, there is still a several day delay before the requested copy is received. Requests for documents written before 1989 are still pro- cessed manually. BBFS is a distributed broadband filesystem research effort[2] [3] to support data-intensive applications, such as HDTV video. It is able to meet real-time constraints and stream data continuously at broadband rates. BBFS depends on distributed and parallel computing to provide the commun- ications and processing needed to support resource-intensive applications. The current prototype system has 36 disks, providing more than 40 Gbytes of storage. When the Ferret project was started, approximately 10,000 documents had already been scanned and stored on the Infor- mation Services Network's optical disks. This data was transferred to the BBFS file server via eight millimeter tape. Newly scanned documents are sent nightly via a local area network. The document images are stored as multi-page TIFF format files[4] using CCITT Group 4 Facsimile- compatible compression[5]. Without compression, our collec- tion of 21,000 documents would require 1200 Gbytes, far exceeding the current capacity of the BBFS file server. With compression, only 30 Gbytes are needed. In addition to storing images of documents at the original 400 dpi encoding, we also resample the images and store a 100 dpi version for display on a workstation. This is the lowest resolution readily visible on our workstations. Con- veniently, the printed part of the document fits on the 1152x900 pixel screen of a Sun workstation at this resolu- tion. Photographs from the AT&T Archives are being scanned as time permits. Each is scanned in grayscale with a depth of 8 bits and a resolution of 62 dpi. The current TIFF software makes use of a Lempel-Ziv compression scheme that compresses a typical 8x10 photograph to 200,000 bytes. The images for the photos database are stored on a different Ferret server than the technical memoranda database. This Ferret server runs on a Sun workstation using ordinary magnetic SCSI disks. The image data currently requires 3 Gbytes of storage and is stored on 2 external disks. AT&T employees have access to the LINUS (Library Network User Service) system. It provides access to many online databases, including the technical memorandum and photograph databases. For the technical memoranda, the citations in this database include authors, document numbers, keywords, and complete abstracts. Its Slimmer information retrieval system allows the database to be searched in a variety of ways. Once a document is chosen, the user issues a Slimmer command to view its image. In response, Slimmer determines which Ferret server provides documents for this database and sends it a message over the corporate network. A window for viewing the document is then created on the workstation. The file containing the images of the entire document, compressed at 100 dpi, is transmitted from the BBFS file server to a temporary file on the user's workstation. The transmission time depends on the length of the document and the speed of the communications line. Over the local Ether- net, a 25 page document requires less than 2 seconds to be transferred. When traversing a 64 kbit/sec interlocation link, the same document takes a minute to be sent. To provide the illusion of instantaneous access, the brows- ing window is created while the file is being transferred. The window is usually displayed before the transfer of the compressed image is complete. While the document is being sent to the workstation, the Ferret browser allows the pages that have been already transferred to be viewed. Because the document is stored by ascending page number, the first page is usually transferred by the time the window is created. This allows the browser to display the first page of the document without delay. As long as the user does not immediately try to read a page at the end of the document, it gives the appearance that the file has been transferred in a few seconds. As shown in Figure 1, Ferret has a simple user interface. The main feature is a large window (850x1100 pixels) which contains a page of the document encoded at 100 dpi. To its left is a slider bar that indicates the current page in the document, both by the height of the slider and by a small number just above the slider. The mouse buttons are used to traverse the document. The right button goes to the next page and the left button goes to the previous page. The middle button is used to move to an arbitrary page, indi- cated by the vertical position of the mouse pointer relative to the slider. If the right mouse button is clicked while the control key is held down, the document is continuously scanned in the forward direction. Similarly, the left mouse button is used to continuously scan backwards through the document. The window is destroyed by clicking the middle mouse button while the control key is held down. The browsing software makes use of the compressed file that was copied to the workstation's local filesystem and decompresses each page as requested for display by the user. This technique is feasible because modern workstations can quickly decompress Group 4 Facsimile encoded files. For instance, a Sun IPX workstation takes less that a third of a second to decompress and display a single page. Further, the compressed file for an average document is small, about 15 kbytes to represent each page, or 300 kbytes for a 20 page document, so it is likely to reside in the file system cache so that no disk accesses are required. 3. Slow Ferret We can also serve users who are not connected to the cor- porate network or do not have a terminal with windows or bitmap display capabilities, but do have access to a FAX machine. Documents are rescaled to 200 dpi for transmission via facsimile "fine" mode. The grayscale images of the pho- tographs are rescaled to this size and dithered before transmission via a FAX modem, consuming several minutes of computer time. Most photographs look surprisingly good when printed on the FAX. 4. High Performance Browser The high performance version of the Ferret document browser is designed to explore the feasibility of browsing through documents at high rates, up to 30 page images per second, providing the electronic analogue of flipping pages in a book. It makes use of the HPC local area multicomputer sys- tem[6]. The current HPC/VORX configuration provides commun- ications and distributed processing with 80 Motorola 68020 single board computers and 10 Sun hosts with a bandwidth of 113 Mbit/sec to each network node. Long distance broadband communications is provided with the LuckyNet[7] system. LuckyNet currently provides connections between three AT&T Bell Laboratories sites: Holmdel, Craw- ford Hill, and Murray Hill, with a total bandwidth of 452 Mbit/sec. The 5 km distance between Holmdel and Crawford Hill is spanned with a multi-fiber cable mounted on tele- phone poles and the 37 km link from Crawford Hill to Murray Hill is provided by line-of-sight super-high frequency (SHF) radio. The HPC switch is distributed among the three sites. Its VORX operating system provides seamless computing and communications among these locations. Documents are displayed on the Liaison networked multimedia workstation[8]. The prototype Liaison workstation is able to simultaneously display several windows with 30 frame per second video. each arriving from a different processor via the HPC. Its display is based on a Synergy Microsystems PEGC video board with a 1280x1024 pixel frame buffer con- nected to the local bus of its 33 MHz Motorola 68020-based single board computer. The decompression software described previously is too slow to be used to decompress page images on the fly. Two feasi- ble solutions to this problem are to use parallel processing to speed the decompression, and to decompress the document once and store its bitmap in a cache. Our first experiments have been with the latter approach. Because the storage required by the images of a complete document may exceed the amount of local memory in our pro- cessors, we are forced to cache the document on disk. or speed, we make use of disk striping[9], a technique that allows the parallel operation of several disks. When a document is chosen for viewing, the entire document is obtained from the BBFS file server and decompressed into the bitmap page images. The bitmap images are sent to a striped file on the BBFS file server. The file is striped across several disks round-robin, by scan line, to allow for paral- lel access when the file is read. When the user requests a page to be displayed, the disks send data simultaneously to the Liaison workstation via the HPC interconnect. The current configuration uses four disks for parallel access and allows the document to be displayed at speeds up to 15 page images per second. The user interface is similar to that of the widely accessible browser. The major addi- tion is a slider bar on the right side that allows the images to be paged forwards and backwards at various rates. Surprisingly, individual images can still be discerned at the rate of 15 pages per seconds. 5. Implementation Details The widely accessible browser is implemented by several com- puters on the corporate internet, as shown in Figure 2. The LINUS system runs on a network of Sun workstations. It per- forms authentication and allows a user to search numerous databases interactively via Slimmer from home or office. It can be accessed via the rlogin[10] command from a work- station on the corporate internet. The Ferret server for the technical memoranda database runs on a Sun 4/370 workstation, ferret, that serves as a gateway between the internet and the HPC/LuckyNet interconnect. Ferret accepts TCP connections from workstations on the internet and services requests from the workstations to access the document images. Note that the broadband charac- teristics of BBFS are not needed for this server. We use BBFS because of its large storage capacity. The Ferret browser must be run on a workstation running the X window system[11] or one of its variants. The browser is started by running the linus program from the shell in a window. Linus opens the server end of a TCP connection that will ultimately connect to the LINUS system and starts two processes. One process waits for this TCP connection to be established and the other executes rlogin to connect to the LINUS system. After logging in to LINUS, the user can access many databases. Currently, the internal document database allows access to the bitmap images. The view com- mand has been added to LINUS for a user to initiate the viewing of a document. The view command opens a connection from LINUS to the wait- ing process on the workstation and sends a message with information on how to access the document. That process opens a TCP connection to ferret and uses the information obtained from LINUS to request the document image. It then copies the image (which is compressed in G4 fax format) from ferret to a temporary file on the workstation. While the file is being copied, yet another process is created. This browser process creates a large window on the workstation and acts as the user interface for reading the document image. The browser process reads the temporary file and decompresses pages for display as they are requested. To allow it to run concurrently with the file transfer from ferret, the browser is able to defer the display of a page until it appears in the temporary file. The implementation of the high-performance browser is simi- lar. The major difference is that, instead of copying the compressed image over the TCP connection, ferret decompresses the file with the output directed to a four-way striped file on BBFS. While it is decompressing, the Ferret browser is started on the Liaison workstation. Figure 3 shows how the striped file is sent from BBFS to the work- station. Each square box in the diagram indicates a proces- sor obtained from a pool of free processors. The processors labeled diskfs run a program that communicates with the X program controlling the user interface. Each of the diskfs processors respond to a request to display a new page by copying their portion of the image to the vfilter program. The vfilter is a standard part of the Liaison workstation and is responsible for the positioning, clipping, and syn- chronization of images destined for the workstation[8][12]. In particular, it assures that the transmission of video to the workstation is synchronized with the monitor refresh. 6. Detail Mode The Ferret browser normally displays monochrome images sam- pled at a resolution of 100 dpi. While text in these images is legible, the low resolution makes small fonts hard to read. Sampling the data at a finer resolution and directly displaying it on the screen results in an image much larger than the workstation screen, making it necessary to move the image left and right to read each line of text. This approach was rejected because it is cumbersome to use. To provide a more detailed display, the browser makes use of the ability of most color and gray-scale workstations to display each pixel on the screen in varying intensities of gray. To present a detailed image of a document, the browser retrieves the 400 dpi image from the file server. It converts it to a 100 dpi image by mapping each four by four square of pixels in the 400 dpi image to a single pixel in the 100 dpi image. The intensity of the single pixel is made proportional to the number of white pixels in the 400 dpi image. This technique produces an image more legible and aesthet- icly more pleasing than the normal 100 dpi monochrome image. The drawback is that it takes several seconds to process and display a high-detail page, as opposed to a half second for a monochrome page. More processing is needed because the 400 dpi image takes longer to transfer and decompress than the 100 dpi image and because Ferret needs to describe the image to the X window system using one byte per pixel instead of one bit per pixel. Because of these significantly longer latencies, the browser normally displays images in monochrome. However, the detailed image of a page may be requested at any time. There is also a "detail mode" in which, whenever a new page is requested, it is first displayed in monochrome and then its detailed image is processed and displayed. 7. Printing Because it is sometimes useful to have a printed copy of a document, Ferret includes a command that sends a copy of the document to a local printer. Most of the available printers make use of the PostScript language. However, the PostScript printers that we tried were far too slow to print 400 dpi full-page images, taking between 5 and 15 minutes to read the image of a single page from the Ethernet, process it and and print it. Our printing facility makes use of the Sun SPARCprinter, a low-cost laser printer that connects to the internal bus of a Sparc workstation. The printer does no image processing. Instead, it accepts bitmap page images from the workstation and prints them. The manufacturer's intent is to run page description software such as PostScript on the host and to ship its output to the printer. We circumvent this process- ing by decompressing our 400 dpi pages images and sending them directly to the printer. This permits us to print our documents at eleven pages per second, the rated speed of the printer. 8. PostScript documents Many word processing systems produce output in the PostScript page description language[13]. Many laser printers understand this language and can print documents in PostScript. PostScript document can be displayed on a work- station running the X window system with a PostScript view- ing program. For documents in PostScript, the Ferret system stores the PostScript version of the document. PostScript has some advantages. The PostScript version is shorter, requiring less disk space and less time to transmit to the user's workstation. Also, a single PostScript representation suf- fices for printing on devices with different resolutions and capabilities such as color. Unfortunately, PostScript is not ubiquitous. Not all work- stations have vendor-supplied PostScript viewing programs. A public domain viewer is available, but it is slow and has a limited selection of fonts. Further, PostScript output is not always portable. In addition to the font problem men- tioned above, some documents do not display correctly with some viewing programs, presumably due to bugs either in the document or the viewer. Because of these problems, when a PostScript document is entered into Ferret, the document is rendered into a compressed 100 dpi bitmap image that is stored on the file server in addition to the PostScript version. Each user can configure Ferret to specify a PostScript viewing program. If a viewer is specified, the PostScript document is sent to the workstation for display by the viewing program. Other- wise, the compressed bitmap image is sent to the workstation for display. 9. Conclusions User reaction to the Ferret System has been positive. In April, 1992 a survey of 140 Ferret users was conducted[14]. We initially thought that Ferret would be mainly used as a screening mechanism to find documents of interest. Instead, we find that over 95% of the survey respondents read the documents from their computer screens, instead of requesting printed documents. People remarked that it was faster and more efficient to read documents from their screens, and that they no longer felt a need to keep their own paper copies. We were also pleased to note that while most people accessed Ferret infrequently, on an average of once a week, most found the system to be easy to use. Our design decision to provide speed instead of beauty was the correct one. The 1/3 second delay between clicking the mouse and seeing the next page is barely perceptible, making the browsing process more comfortable and natural. As evi- denced by the user survey, the 100 dpi resolution seems to be adequate for reading documents from the display. How- ever, as technology improves, this is one of the first things that we would upgrade. Document browsing is coming of age. We know of other efforts to provide document images to users via data net- works, including projects at the U. S. Patent Office and Carnegie-Mellon University. At AT&T, the RightPages sys- tem[15] provides alerting and browsing for journal articles. The most significant difference between Ferret and these systems is that with Ferret, the information retrieval sys- tem is decoupled from the image viewing system. This makes it easy to provide a variety of front ends to access the same images. For instance, within AT&T, the technical com- munity could make use of a sophisticated keyword based sys- tem, but managers, who may be less computer-literate and less familiar with the details of the subject matter, may have trouble using such a system. Instead, they could make use of a point-and-click graphical system which helps guide them to the information they need. Another advantage of the decoupling is that different Ferret image databases may reside in different locations. We have observed that many information providers prefer to actually own the equipment that stores their images, and have it phy- sically close by. The Ferret Document Browser has become both an interesting research tool and a useful service. The high performance version allows us to experiment with the implications of presenting document pages at high rates and the widely accessible version allow the AT&T R&D community electronic access to a useful set of documents. Future plans include allowing for the use of color images and higher resolution displays. Applications like catalog shopping and newspaper viewing will benefit from these enhancements. Faster networks, interfaces, and decompres- sion software will be necessary to maintain reasonable per- formance. 10. Acknowledgements We would like to thank Bob Gaglianello who suggested that the Ferret browser could be made widely available. The "back of the envelope" calculations performed in these dis- cussions showed that the AT&T R&D Internet is fast enough to provide reasonable response time. Other people also pro- vided invaluable help. Bill Austin was our primary liaison with the Information Services Network. Beth Robinson and Bruce Hillyer provided an early version of the BBFS file server for our use. Robert Waldstein integrated our software with the production LINUS system. Jan Wolitzky, Carlos Cruz, and Bill Boehm helped transfer the initial set of 10,000 documents to BBFS. Henry Shen helped port the software to an Intel 386-based workstation. REFERENCES 1. Austin, W. E., and Wolitzky, J. I., "Electronic Document Management System for AT&T Proprietary Technical Infor- mation," INFORM Magazine, June, 1991. 2. Hillyer, B. K., and B. Robinson, "Aspects of the BBFS Broadband Filesystem," Proc. First International Confer- ence on Parallel and Distributed Information Systems, Miami Beach, FL, December 1991. 3. Hillyer, B. K., and B. Robinson, "Communications Issues in BBFS, a Broadband Distributed Filesystem," Proc. Glo- becom '91, Phoenix, AZ, December 1991. 4. "Tag Image File Format Specification-Revision 5.0," Aldus Corporation, August 8, 1988. 5. Hunter, Roy, and A. Harry Robinson, "International Digi- tal Facsimile Coding Standards," Proc IEEE 68,7, July, 1980, 854-867. 6. Gaglianello, R. D., et. al., "HPC/VORX: A Local Area Multicomputer System," Proc Ninth Internat Conf on Distr Comput Sys, June 1989, Newport Beach, 542-549. 7. Gitlin, R. D., et. al., "LuckyNet: An Overview," Proc. Globecom '91, December 1991, Phoenix, 1055-1064. 8. Katseff, H. P., et. al, "Experiences with the Liaison Network Multimedia Workstation," Proc USENIX Symp on Experiences with Distr and Multiproc Syst, Atlanta, March 1991, 341-350. 9. Kim, M. Y., Parallel Operation of Magnetic Storage Dev- ices: Synchronized Disk Interleaving," Fourth Internat Wkshp on Datab Mach, D. J. DeWitt and H. Boral, eds., Springer Verlag, New York, 1985, 300-330. 10. UNIX Programmer's Manual, 4.2 Berkeley System Distribu- tion, Vol. 1, Computer Science Division, University of California, Berkeley, August 1983. 11. Scheifler, R. W., and Gettys, J., "The X Window System," ACM Trans on Graphics 5,2, April 1986, 79-109. 12. Katseff, H. P., and R. D. Gaglianello, "On the Synchron- ization and Display of Multiple Full-Motion Video Streams," Proc IEEE TriComm '91: Communic for Distrib Applicat and Syst, Chapel Hill, April 1991, 3-9. 13. "PostScript Reference Manual," Adobe Systems Inc., Addison-Wesley, 1985. 14. Austin, W. E., and Lunas, L., personal communication. 15. Story, G. A., et. al., "The RightPages Image-Based Elec- tronic Library for Alerting and Browsing," IEEE Com- puter, Sept 1992, 17-16. Biographical Information Howard Katseff received his B.S. Degree in Computer Science at Cornell University in 1974. He did graduate work in theoretical computer science at the University of Califor- nia, Berkeley and received his Ph.D. in 1978. Since then, he has worked at AT&T Bell Laboratories in Holmdel, New Jer- sey. He wrote the UNIXr debugger sdb and worked on the design and implementation of multiprocessor computer sys- tems. He is now investigating applications for broadband networks. He may be reached at hpk@research.att.com . Thomas London received his B.A. Degree in Mathematics from the University of Pennsylvania in 1972. He received his M.S. degree in 1974 and his Ph.D. degree in 1976 from Cor- nell University studying aspects of security and protection in computer systems. Working in Computer Science research at AT&T Bell Laboratories in Holmdel, NJ, since 1976, he has conducted research in operating systems, multiprocessor pro- gramming and systems, and communications intensive services and systems. He may be reached at tbl@research.att.com .