################################################ # # # ## ## ###### ####### ## ## ## ## ## # # ## ## ## ## ## ### ## ## ## ## # # ## ## ## ## #### ## ## ## ## # # ## ## ###### ###### ## ## ## ## ### # # ## ## ## ## ## #### ## ## ## # # ## ## ## ## ## ## ### ## ## ## # # ####### ###### ####### ## ## ## ## ## # # # ################################################ The following paper was originally published in the Proceedings of the Tenth USENIX System Administration Conference Chicago, IL, USA, Sept. 29 - Oct. 4,1996. For more information about USENIX Association contact: 1. Phone: (510) 528-8649 2. FAX: (510) 548-5738 3. Email: office@usenix.org 4. WWW URL: https://www.usenix.org Using Visualization in System and Network Administration Doug Hughes - Auburn University ABSTRACT Unix systems have numerous tools that generate copious amounts of data about performance, security, process status, networking and sundry other things; this usually results in the output of raw numbers or text. Visualization of this data can lead to useful insights. Examples include: graphing performance data, customizing tools to make a complex operation easier to understand, and correlating log events according to user-defined rules or patterns. Using visualization in daily activity can help the brain recognize normal and aberrant behavior, make complex tasks easier, and improve workplace efficiency. The goal of this paper is to detail how simple rapid- prototyping, GUI tools can be used to quickly develop applications to visualize data. This paper also describes the application of visualization to areas of system and network administration. I will attempt to illustrate my own results and experiences developing visualization tools in Tcl/Tk [1] (with various extensions). The design and implementation of four such tools will be discussed and used as examples of visualization applied toward system and network administration. Some have surpassed their original design criteria and provided unanticipated side-effects that enhance their utility. Introduction Though the precise definition of visualization is often debated, it is customarily used as a way to express information in two or three dimensions. The advantage of visualization is that it takes raw data and manipulates it such that recognizable patterns begin to emerge, or presents it such that a certain task becomes easier to understand or more efficient. Visualization can also be used for prediction and intuitive troubleshooting, once sufficient data has been collected. Through the use of rapid prototyping GUI tools like Tcl/Tk, Perl [2]/Tk, Python, etc. one can develop simple visualization tools that can help in various system and network administration tasks. By nature, visualization works best with a GUI interface. According to Jakob Nielsen [3], there are five aspects to usability of a GUI interface: easy to learn, efficient, easy to remember, relatively error free or error forgiving, and pleasant to use. Because of the user-definable nature of the tools that I will be describing, and the target audience (myself, my co-work- ers, and others in my technical field), I will be concentrating on the first three aspects. This may cause the tools at times to be plain in appearance and unforgiving when used in unexpected ways, but future iterations and exten- sions will work to correct any deficiencies. All these tools have been designed in a language which makes individual tailoring quite easy. Currently Auburn University College of Engineering uses several tools written in Tcl/Tk to indicate health and performance of networks and machines, to make certain tasks more manageable, and to troubleshoot when problems occur. The tools I will be focusing on include a security analysis tool, a server CPU pie-chart tool, a SPARCstorage Array disk visualization, placement, and optimization tool, and briefly, a hub analysis tool. Motivation These tools have all been designed to make our jobs easier and more pro- ductive. Without these tools, perusing the raw data streams would be a time consuming and frustrating process. They help us to always be aware of events as they happen. Being a part of a support organization, it's important to at least give the illusion to your users that you know about a problem before they report it. If only the phone would answer itself when a server went down... tklogger We use TCP Wrappers [4], klaxon, tocsin [5], and other customized daemons and programs to log security events to a centralized, secure, limited-access machine via syslog. Originally, all of our syslog data was simply posted on the console of a machine as it arrived, without any organization or correla- tion possible. This resulted in one of two undesirable situations. First, if the console window was on top, a portion of the screen space was being used; if nothing was happening, that screen space was wasted. Second, if the console was in the back, the information was largely lost behind another window. Tklogger was developed as a real-time security analysis tool. The goal was to have a way to correlate high priority and low priority traffic based on syslog data files and regular expression matches. It was determined that events should also be displayed in multiple colors to facilitate their inter- pretation at a glance and without having to scrutinize the text (color event classification). Subsequently, a quick glance at the window could determine the security status of the hosts and the network. It was also decided that when important events occurred, the window would automatically come to the front. This way the administrator using the machine (myself) would not always have to keep a third eye on the console, nor would useful screen space be wasted. Other features such as search capability, scrollback, and easy config- uration were added as needed. It was originally constructed a few hours a day over the course of a week. Little modifications were made for the first few months, but the general functionality remained relatively unchanged. cpupie Cpupie was inspired by an object oriented Tcl/Tk pie-chart extension called tkpiechart [6]. Out of this simple demo came the idea for representing the CPU states of our servers as pies. The already widespread practice of breaking up a CPU into four components (idle, wait, system, and user) made its use in this capacity immediately apparent. Thus, displaying the CPU trends of our servers was imagined (and since proven) to be a useful symptom of server health. Mr. Fontaine (the author of the pie-chart widget) and I collaborated on the features of tkpiechart to get it to its present state. Excluding that work, the initial design phase for cpupie was approximately 16 hours. ssa Ssa was born in an afternoon of frustration. We have two SPARCstorage Arrays [7] attached to two servers. At the time, each had 12 disks that were almost completely filled with approximately 14 file systems on RAID-5 volumes. Also at the time, three new disks had come in for each array to add more file system capability to our home directory servers. Since the array model that we own is already divided into six SCSI con- trollers of up to five devices per controller (Figure 1), it is convenient to purchase disks in groups of six to maximize the advantages of a six disk RAID-5 stripe. After we exhausted the capacity of our existing disks we were left in a bit of a conundrum: how do we take three disks, add them to the array, and arrange them such that no two stripes share a disk, busy stripes do not share a controller, and all disks are evenly used without overburdening the busy stripes? (The astute reader may note that with three new disks it is probably impossible to not have at least one volume that has two stripes on the same controller using a six disk stripe given that all disks will be evenly utilized.) Without this tool, we would have had to write all configurations on a white-board in text with volume sizes and layout then manually arrange things by erasing, moving, and updating all totals without being able to see the final results until implementation. This approach was fraught with peril, was essentially brute-force and may have taken us hours for each array. The ssa tool was written to speed up this process for this and all future instances of new disk arrival, removal, or failure. The first drag-and-drop, user-hostile version was made available in four hours. Since the creation, our decreased time spent calculating movements has more than made up for this design time. ------------------------------------------------------------------------------- empty disk c0-c6 = controllers disk in use | | | | | | | | | | | | | | c0-----+c2-----c4------ | | | | | | | | | | c1 |c3 c5 Tray 1 Tray 2 Tray 3 Figure 1: Array physical layout ------------------------------------------------------------------------------- hphubwatch Finally, hphubwatch was written at a time when slowness problems were being experienced on several of our 30 networks. We required tools to monitor and analyze our HP hubs, but our budget was extremely tight. This tool uses SNMP [8] to gather information from HP AdvanceStack hubs and display it graph- ically in real time. It was written in a few hours and is fairly small and simple. Program Implementation and Usage tklogger As already mentioned, the output of tklogger is contained in two windows. The latter contains high priority events and the former contains low priority events. An example window layout is in Figure 2. Normally, each text line would be displayed in the color associated with that event. An event is repre- sented in user-defined colors to enable easily recognizing the type of event when it occurs without actually reading the corresponding text. In this way, a user can tell at a glance when something bad has happened or when an unusual pattern occurs. Further investigation may be warranted, particularly if an alarm color is displayed (anything matching red is high priority). The idea for using colors to represent events occurred to me one day while experimenting with a Tcl/Tk text widget and seeing how easily it could be configured to display lines of text in multiple colors and/or fonts. The idea for using a Tcl/Tk text widget to display time sensitive logging informa- tion in colors blossomed out of that. ------------------------------------------------------------------ Figure 2: tklogger Augmenting the visualization is a search capability that allows one to highlight matches in place or display them in a separate window. Menus are available for users to adjust scrollback capability, save events to a file, reload the configuration file, adjust the search options, and perform other actions. One can also pause the polling to examine an event more closely before resuming. When an alarm event occurs the window immediately de-iconifies itself if necessary and rises above all other windows on the screen. A recent addition has allowed it to also perform other actions such as sending mail, paging, ringing the keyboard bell, or executing a user defined function written as a Tcl procedure. We have tklogger running at all times monitoring and displaying various events. With proper configuration, tklogger can be used to monitor many log files concurrently. Each file may be given a base priority for all records. These base priorities can then be overridden with regular expressions. A Sample con- figuration file is shown in Figure 3. File directives specify the log file to poll. Color directives give a priority (high or low - based on color choice) that is the base priority for events in that file. All events appended to the end of a file will have this base priority color. Files that do not have a color directive will be examined for match expressions. Regular or fixed expression match directives are used to override any applicable base priority, or to execute a command. Ignore directives are used to elide base priority information that may not be necessary (e.g., debugging information from send- mail). -file-auth-/var/log/authlog---------------------------------------------------- file daemon /var/log/daemon file local0 /var/log/local0.info file local1 /var/log/local1.note file local2 /var/log/local2.warn file local3 /var/log/local3.note file maillog /var/log/maillog color local0 forestgreen color local1 lightseagreen color local2 magenta color local3 red1 color auth red2 ignore NOQUEUE match dlam red1 match mooneje {email page-doug {.cshrc accessed}} match help {playsound /home/ens/doug/sounds/chord.au orange} match {LOGIN FAILURE} mediumvioletred match (pgcntd|refused) red4 match portwatcher red3 match (vrfy|expn) violetred Figure 3: tklogger rules ------------------------------------------------------------------------------- I would be remiss if I did not compare the usage of tklogger with two other popular log analysis tools. Contool [9] provides a way to perform actions when certain messages appear on the console, but lacks in areas such as event grouping and searching capability. Swatch [10] is another extremely useful and extensible log analysis tool written in Perl, which gives it all the power of that language. However, swatch is meant to be run in the back- ground and I was looking for something more visual. Both tools are meant to process one input source at a time. ------------------------------------------------------------------ Figure 4: cpupie cpupie To analyze our server CPU states, cpupie (see Figure 4) uses the rstat(3) portion of the scotty [11] Tcl/Tk extension. Since all Unix platforms of which I am aware support rstat(3), the operating system independent nature of this tool was immediately attractive. The scotty extension also gave us the added advantage of client/server sockets so that only one machine was required to do the polling. Any other machines wanting to view CPU status of any subset of the servers could connect to the master machine for updates via a TCP/IP socket. The cpupie program was designed with several simplistic features that we have found useful. The most commonly used features are available via buttons along the top of the window. It takes advantage of the native PostScript gen- eration capabilities of the Tk canvas widget to output the current CPU states to color or black and white PostScript printers at the touch of a key. It has the capability to average the CPU states of all the servers over a user defined time interval. Finally, the polling interval and listening socket (for client/server operation) are user configurable. Cpupie is constantly being run by several people on two-headed workstations to monitor our servers. ssa Like cpupie, the ssa tool (Figure 5) is also implemented with a canvas widget for easily printing the current array layout. The most basic unit of operation on the canvas is a color filled rectangle. The disks are large rect- angles, and each stripe (subdisk) of a disk is a smaller color filled rectan- gle. The stripes of a particular volume are all the same color across all disks. Mirrors of a volume are also the same color, but are stippled with a bitmap to indicate that they are mirrors and not the primary disk(s). Log disks have the same stipple as the mirrors do, but are usually invisible because of their tiny size, so a feature was added to artificially increase their size to identify disk location. In order to understand the use of the ssa tool, one must also understand some of the nomenclature specific to the Veritas software that drives the SPARCstorage Array. Each physical disk is mapped to a VM disk which is divided into up to 16 subdisks via a virtual table of contents (VTOC). Subdisks are then amalgamated into a unit called a plex. This can be a RAID-3, RAID-5, or striped plex (in fact the basic functionality of a plex is analogous to a stripe in the conventional sense). These plexes can then be mirrored, concate- nated, or combined with a log into a volume. In our case a file system goes onto the volume, though some places use them for raw databases partitions. The relationship between disks, subdisks and plexes is illustrated in Figure 6. VM disks | | Physical Disks | Striped plexes |disk01-01 | | +disk01-02 | ++++disk01-03 disk01-01 | c1t0d0s2| + disk02-01 + | |++ disk03-01 | | + | | |disk02-01 | | ++disk02-02+ | c1t1d0s2+disk02-03+ | |++ + +++ | | ++ | | | | + | | |disk03-01 + | c1t2d0s2+disk03-02++ ++ | |disk03-03 Figure 6: Volume Manager terminology ------------------------------------------------------------------------------- ------------------------------------------------------------------ Figure 5: ssa For people who don't use the Veritas Volume Manager (VxVM) GUI interface, a button also displays the sequence of commands necessary to configure the array when it has been arranged as desired. This tool will not be valuable to people who do not have the Veritas software installed, but it is capable of running without an array. The use of this tool is a bit more complex then the previously mentioned ones. Subdisks can be dragged from one disk onto another. New empty disks can be created with the click of a button to simulate the addition of a disk to the array. Also, there is an undo function, a way to determine the size of a subdisk by clicking on it, and a way to determine the used and free space on a physical disk by clicking on it with mouse button 2. This tool is very VxVM specific, but it has served us quite well by allowing us to fine-tune place- ment of file systems on our SPARCstorage array and experiment with different configurations before implementation. It serves as a useful example of using visualization to simplify complex tasks. hphubwatch The hub watching tool is also implemented using scotty and, like tklog- ger, makes use of the color capabilities of the text widget. A portion of the window is shown in Figure 7. It currently assumes one machine is connected per port. The three columns following the MAC address correspond to administrative status, operational status, and media status respectively. The rest of the columns are explained in the legend of the figure. Information that has changed is highlighted in yellow. In the figure, the number of frames has changed on ports 5, 7, 8, and 9 (yellow) and the percentage of collisions to frames on port 8 and 9 is greater than 0 (red). Information that is deemed significant is highlighted in red. This information includes giants, jabbers, alignment errors, and other events of this nature which might indicate prob- lems. The polling interval is user specified and the highlighted regions change after each polling interval. It is a very simple tool with narrowly defined operating parameters. We use it to troubleshoot networks through their hubs when problems occur. ------------------------------------------------------------------ Figure 7: : hphubwatch Visualization Results All of the tools have lived up to my expectations of their design. There have been serendipitous surprises as well. The paragraphs below will attempt to outline the expected results as well as describing how some of our expecta- tions have been surpassed. tklogger Tklogger has already proved its usefulness at detecting many different types of events in progress including mail spam, port scanning, user account cracking, and misconfigured daemons. When a high priority event occurs, it has been beneficial to search back to find low priority events that may provide correlative information. We even have people using it for different purposes. I use it to monitor security while another person uses it to monitor WWW logs. The multiple input file functionality has proven to be the most useful (non-visualization related) feature by allowing syslog generated events to be organized in multiple files by facility and priority. Tools that generate sys- log information have the information forwarded to the secure log-host where it is stored in files according to simple syslog configuration rules. This facilitates searching and archival of logs based on administrator defined event groupings. This has been considerably easier than dealing with one large file. cpupie Cpupie began as a fun project to do with Tcl/Tk that could tell us gener- ally how busy our servers were. It has since proven to be a useful forensic tool for diagnosing potential server problems before they happen. The CPU states convey information about other pieces of the system indirectly such as memory and disk usage. The color green on any CPU is intuitively obvious. Our brains have been trained from childhood that green indicates that all is well. In this case green represents idle time. A green CPU tells us the machine has room to grow. Yellow corresponds to system activity. Typically, yellow on our CPU's is 5-25% of the pie. When yellow gets higher than this, it is usually an indica- tion to us that there is either a lot of process activity occurring (a daemon possibly gone amok) or a lot of memory in use for some reason. Our license server typically has 15% of its single CPU in yellow. When a CPU exceeds 15% in the red state, we investigate. This usually indicates that there is a large amount of disk activity occurring. On our mail server this can be normal if a large mail list is being processed. On any other servers this usually means the machine is involved in some swapping or paging. Subsequent investigation leads to a culprit or the resignation that more memory is needed. There is an interesting distinction between the user CPU state (displayed in blue) and the other three states. While resource usage typically balances across all CPU's, CPU time spent in user is easier to visualize on one proces- sor of a machine. Therefore, on a machine with four CPU's, a CPU with 50% blue should be interpreted as 100% of two CPU's in use. On our quad-CPU machines the affect of this demarcation is readily apparent. The blue slice of the pie is almost always at one of five levels: 0, 25, 50, 75, or 100. We can tell at a glance which compute servers would be candidates for new jobs. As our users write code to take advantage of multi-threading in modern operating systems, these distinctions may be less visible. In Figure 4, the machine darwin is a four CPU compute server with one CPU servicing a CPU-bound process and dns is a mail server (processing a mail list as it would happen). ssa Since our first addition of disks, we have learned that it is much easier to buy disks in groups of six to add to the arrays. However, we still use the tool when new disks arrive. In order to not overburden any controller or disk, we try to distribute things as evenly as possible. This means that when new disks come in we need to move subdisks and adjust free space to provide bal- anced access across controllers and disks. It has gone through several small enhancements since its creation in November of 1995 and is now under general release to help other users of Veritas Software around the world. The SPARCstorage array tool makes our jobs a lot easier by allowing us to see at a glance where space is available. VxVM is very flexible at letting you customize your views of the array, but is not designed to allow one to experi- ment with new disk/volume layouts. It is a production tool. Once you commit to an action, you must wait until it is completed. With the storage array tool we have been able to reduce our configuration time from hours to minutes, save a lot of white-board space, print out (graphically) a picture of the current array layout, and see the actual commands that will be executed by VxVM when we are done. hphubwatch We have already identified several high traffic machines in work-groups on particular subnets and endeavored to further segment these networks, or install switches when required. In addition the tool has discovered, via high numbers of media errors, several old, flat, gray-satin cables that people had used to plug their computers into the network. It has also discovered the occasional improperly punched-down house wiring that has reverse polarity. In the hubs where security is enabled, it lets us track intruders on ports where undergrads may be trying to plug in their laptops or otherwise access the net- work in an unauthorized manner. Conclusions In my experience the genesis of new visualization tools has been 50% need and 50% inspiration. One of our most useful performance tools (cpupie) was inspired by a simple demo. Another (tklogger) was created as an experiment in log analysis. A third was created to meet a specific goal (ssa). The remaining tool presented in this paper (hphubwatch) was created as an experiment and out of the desire to manage our hubs. All four tools have been in use for 6-18 months. I have found that using different colors is a convenient way to express desired characteristics. The brain can distinguish between thousands if not millions of colors. More is not necessarily better, though. Experiment with your color choices and use those that work best on your application in your environment. I have experimented with fonts in different pitches and styles (e.g., italic vs. bold), but they do not have the versatility of color. Com- bining font changes with color changes has worked well. When you do use col- ors, try to use consistent strategies among tools. (i.e green is good, yellow may need investigating, red should be checked on immediately). Lastly, try to arrange contrasting colors between similar colors (stick a purple between your light seagreen and your medium seagreen). Visualization tools help the system and network administrator to make sense out of complex data. Most any program that generates statistical or per- formance data can be visualized. Using visualization can provide a revelation out of an otherwise rudimentary set of unprocessed data. Patterns may begin to emerge which make management of resources, and trend analysis easier. Visualization also can make complex tasks (e.g., file system rearrange- ment on a group of disks) easier by displaying the problem in an easier to understand or more intuitive format. Tcl/Tk has been particularly suited to these tasks by providing an easy to learn interpreted language with excellent publicly available contributions, and extensibility via C and C++. Acknowledgments I would like to thank my boss, Steve Henderson, for giving me the freedom to construct the tools about which this paper is written. I would also like to thank all the contributors of feedback and enhancements for their input. Finally, I would like to extend my deepest appreciation to all of the writers of the extensions to Tcl/Tk that made these tools possible, particularly Jean- Luc Fontaine, Karl Lehenbauer, and J"ergen Sch"enw"lder for contributing tkpiechart, TclX, and scotty respectively. Availability All tools will run on any Unix platform where Tcl/Tk and the relevant extensions can be compiled (which is most of them). Tklogger is self suffi- cient and should be able to run on Windows and Macintosh platforms as well (with minor modifications regarding file naming conventions). All tools are available for anonymous FTP at ftp.eng.auburn.edu in the pub/doug directory. A sample configuration file for tklogger (.tkloggerrc) is available on this server as well. Tklogger is also available at the Coast security archives (https://www.cs.purdue.edu/coast). Documentation for each tool is included inside the file near the top. Author Information Doug Hughes got started administering Sun systems while attending Penn State University. He graduated in 1991 with a BE in Computer Engineering. From there he spent three years in GE Aerospace before and after the merger with Martin Marietta (pre-Lockheed, post-RCA) doing everything from software devel- opment to networking, systems administration, and internal consulting for client-server projects. From there he went to Auburn University College of Engineering where he now resides as the Senior Network Engineer. His interests include writing programs in scripting language to ostensibly make his job eas- ier. He can be reached via U.S. Mail at 103 L building, Auburn, AL 36849, or via electronic mail at Doug.Hughes@eng.auburn.edu. References [1] John K. Ousterhout, Tcl and the Tk Toolkit, Addison Wesley, Reading, Mass, April 1994. [2] Larry Wall, Randal Schwartz, Programming Perl, O'Reilly and Associates, Sebastopol, CA. 1991. [3] Jakob Nielsen, ``Iterative User-Interface Design'', IEEE Computer Society, Computer, Vol 27, n7, July 1994, pp. 32-41. [4] Wietse Venema, ``TCP Wrapper: Network Monitoring, Access Control and Booby Traps'', Proc. 1992 USENIX UNIX Security Symposium, pp 85-92. [5] Doug Hughes, klaxon and tocsin - detecting port scanning, 1995-96, unpub- lished tools, available via FTP at ftp.eng.auburn.edu:/pub/doug/. [6] Jean-Luc Fontaine, a Tcl/Tk pie utility, 1996, available via ftp at ftp.neosoft.com in /pub/tcl /alcatel/code/tkpiechart-*.tar.gz. [7] Sun Microsystems Inc., Understanding Disk Arrays, white paper, 1994. [8] Marshall T. Rose, The Simple Book, 2nd edition, Prentice Hall, Inc., Englewood Cliffs, New Jersey, 1994. [9] Chuck Musciano, contool, 1994, available via ftp at ftp.x.org:/R5contrib/. [10] Stephen E. Hansen, E. Todd Atkins, ``Automated System Monitoring and Notification With Swatch'', Proc. Nov. 1993 USENIX LISA, pp 145-155. [11] J"ergen Sch"enw"lder, ``scotty - a Tcl interpreter with TCP/IP exten- sions'', Proc. 3rd USENIX Tcl/Tk Workshop, Jul. 1995.