################################################ # # # ## ## ###### ####### ## ## ## ## ## # # ## ## ## ## ## ### ## ## ## ## # # ## ## ## ## #### ## ## ## ## # # ## ## ###### ###### ## ## ## ## ### # # ## ## ## ## ## #### ## ## ## # # ## ## ## ## ## ## ### ## ## ## # # ####### ###### ####### ## ## ## ## ## # # # ################################################ The following paper was originally published in the Proceedings of the Tenth USENIX System Administration Conference Chicago, IL, USA, Sept. 29 - Oct. 4,1996. For more information about USENIX Association contact: 1. Phone: (510) 528-8649 2. FAX: (510) 548-5738 3. Email: office@usenix.org 4. WWW URL: https://www.usenix.org The Igor System Administration Tool Clinton Pierce - Decision Consultants Inc. ABSTRACT This paper describes the system administration tool we call Igor. Igor is a tool for administrating a large number of UNIX systems in a diverse, networked environment. Igor consists of two parts, an interactive GUI which is controlled by an operator, and a daemon which is run on the UNIX target which actually executes the commands. Igor provides very fast operation, and quick post operation analysis of the results. In normal operations we have run commands on over 600 hosts simultaneously in 60 seconds. History Igor was created, because in Ford's highly distributed and diverse envi- ronment, the system administrators often found themselves needing to run some- thing on several hosts very quickly. Some examples include holiday shutdown, system re-configuration, installing patches, fixing bugs, surveying systems for patches and software usage, and emergency damage control (see Script Exam- ples below). Normally, this would be accomplished with something like this shell script: #!/bin/sh VICTIMS="hosta hostb hostc hostd" for cur_host in $VICTIMS do rsh $cur_host "/usr/bin/cmd -arg" done This has many potential problems, not the least of which are: o `rsh' not being able to portably return the status of remotely executed commands. This means that in order to find out if a command worked, some- times elaborate shell-scripting is involved. o stdin/stderr being mixed. This makes checking for errors even more diffi- cult on complex scripts. o Anything beyond simple commands may involve a full-blown shell script meaning complicated rsh commands, or rcp commands, or use of an NFS filesystem to transport the scripts. o Slow hosts can bog down operations o Hosts whose inetd/rshd has gone to lunch cause rsh not to work properly. o This operation is serial. If rsh hangs, your whole list of hosts goes unprocessed. o The ``batch'' style of operation is quite non-satisfying. Especially after a long run, you discover a subtle bug in the script on either end, causing you to have to repeat the lengthy run. o rsh uses whatever ``mystery shell'' is on the other side - and whatever environment comes with it. A very cleverly written shell script can get around almost all of these limitations. Certainly a C program could. However, Igor gets around all of these problems and provides a neat, clean user interface for getting these kinds of jobs done. In addition, other schemes such as running administrative scripts through cron do not allow for a really interactive approach to solving these problems. Detailed Description Igor solves these problems by creating a fast, robust, portable, and flexible method to distribute these kinds of jobs. Igor can be described sim- ply as a multiplexed rsh. With Igor there are two parts, a target and an con- troller. Security is handled through traditional ``rexec''-type security (/etc/hosts.equiv and rhosts files). Igor accepts commands from a GUI on the controlling system, and passes those commands to a set of Perl scripts which distribute the commands to the remote (target) systems. These Perl scripts maintain several remote connec- tions simultaneously, and handle situations such as timeouts, network connec- tion problems and terminating connections. The Perl scripts then collect the information and return it to the GUI. The GUI itself has additional features to help the operator debug remote systems such as: o Double-click (left) on a hostname will open an xterm on that host. o Single-click (right) on a hostname will bring up the most recent results from that host (or group of hosts) in an editor for viewing. o A scrollable list of hosts is always available, showing the status of the last commands run, any output or errors that resulted from the last run, and the current known state of that host (unreachable, running a job, com- pleted, etc.) o Information on the host-type of the various systems. The Code Igor is written in TCL/Tk [1] (as a wish script) and Perl 4 [2], and the target end is entirely written in Perl 4. The only prerequisites for adminis- trating a target with Igor is that the target have available to it Perl (ver- sion 4 or 5), and a host which it implicitly trusts (preferably a centrally located, tightly controlled host). The daemon which runs on the target system uses no Perl library code, and only requires that the interpreter be present. This means that all of the networking code is rolled into the daemon itself. This design was based on the decision that we wanted the daemon to rely on as little as possible to run. For example, if the perl library modules were not mounted, then we did not want the daemon disabled. ------------------------------------------------------------------ Figure 1: The graphic user interface Igor works as an interactive tool, which makes it different from tools such as DSmit for AIX [5] and easier to setup than tools such as Systcl [3]. Igor allows a skilled operator to write shell scripts or perl scripts and have them executed very quickly. The Controlling system's software consists of a GUI which is entirely written in wish(1) and a series of backend perl scripts. The GUI simply man- ages the list of hosts, current set of commands, current set of regular expression matches and some tunable preferences. The rest of the controlling system's software is a set of perl scripts that produce reports on the output data (and use the regular expression data to determine if the run was success- ful), setup socket connections to the remote hosts; rcp, rsh and ping the remote hosts to setup the Igor daemon, The Target System The target is any UNIX host which runs Perl and which trusts a central host (with /.rhosts or /etc/hosts.equiv). Igor runs as a daemon monitoring a pre-determined TCP/IP port. This daemon can either be started by the control- ling system using a ``spawn'' function available to the operator or it can be started by conventional means, such as rc scripts. The ``spawn'' function uses ping to contact the target host, rcp to move the Igord daemon script to the target system and then rsh to run the script. The script daemonizes itself and continues running in the background. The Igor daemon (Igord) then listens to the port and when a connection is made, forks and the child process receives commands from the controlling system. These commands can be shell scripts, perl scripts, or special built-in functions to send files to the target. The Standard Output and Standard Error of all of the executed commands are carefully collected, and put into a boilerplate and sent back to the controlling system for analysis. The child then dies. The preferred method of starting the daemon is to have the target system start it as part of it's initialization. This way, the daemon is always avail- able to run commands, and does not have to be ``re-spawned''. Normally, at Ford, if we find a system that is not starting the daemon at boot time, we spawn a daemon on the host, and then run an Igor job to install itself on the target host, and start itself as part of the next boot. ------------------------------------------------------------------ Figure 2: Point and click GUI The Controller The operator runs Igor from a well-trusted host. This host should (if possible) have a very thick connection to the targets and should have as much CPU as you can spare; the more CPU and network, the more jobs you can run in parallel. Also, the controlling host should be able to open many sockets at once. Under certain OS's (Solaris) this requires a kernel tunable parameter to be set. The amount of resources used by the Controlling system is con- trolled with a ``throttle'' adjustable in a preferences dialog. The throttle controls how many remote hosts will be communicated with at any one time. Set- ting this number high uses more resources. On a Sparc-center 1000 with 1 CPU, a throttle limit of 40 will keep the load-average of the system near 10. On the well-trusted (Controlling) host, the operator first loads in a list of hosts to operate on. This list is simply a flat-ASCII text file, one host per line, and are loaded with a point-and-click file browser [Figure 2]. Once the hosts are loaded, you can perform ``run'' or ``spawn'' operations on those hosts or a selected subset of those hosts. Spawn is used to start Igord on the remote hosts. The host is pinged, the script is rcp'd to the host and rsh is used to get it running. Traditional BSD-style network com- mands are used so that the target host is almost assured to have the necessary utilities already in place to start the daemon. Once the daemon is started, you generally do not have to restart it. If a daemon is already running, and the system is ``respawned'', then the new daemon will kill off the old one, and run in its place. Once the ``RUN'' button is pushed the script that the operator has entered is transmitted to the remote systems, their output collected and sent back to the controlling system. For both Run and Spawn the GUI starts a back end perl script. That process forks as many times as needed to reach the ``throttle'' limit. Then each child takes a system name from the common host- name pool, and tries to contact the Igord on that host and execute the job. When the job is completed a particular host, the child grabs another hostname from the pool and starts again. Each of the communication processes will time- out if necessary, using a value set in the preferences dialog of the GUI, and then take another host from the pool. ------------------------------------------------------------------------------- Figure 3: Script ------------------------------------------------------------------------------- Figure 4: Regular expression list ------------------------------------------------------------------------------- Figure 5: Progress indicator ------------------------------------------------------------------------------- The backend scripts and the GUI operate independently of each other. The GUI simply starts the backend scripts and then can retrieve their output by looking in a hard-wired subdirectory (./Idata) for results from each host. If the user requests another job be run while the first job is still running, that isn't a problem. Another set of backend scripts are started, and the results are left in the same subdirectory. If for some reason the GUI needs to communicate with the backend scripts (to abort a job, for example) either token files are left in a common area which the scripts look for occasionally, or a specialized perl script can communicate back and forth between the GUI and the backend scripts using other IPC mechanisms. Controlling GUI Tour The buttons on the GUI [Figure 1] do the following: o ``Hosts...'': Opens the Host Selection dialog [Figure 10] and allows you to add/change/load the hosts to be worked on. o ``Preferences'': Opens the User Preferences screen [Figure 11]. The fields are: O Throttle - Maximum number of hosts to work on at once. O Timeouts - How long to wait for any one host to respond to Igor's query. Once the host is contacted, the timeout is no longer in effect. O Editor Options - When ``view selected'' is picked for multiple hosts, this selects whether you want to see one host at a time (i.e., ``vi hosta hostb hostc etc..'') or all of the host data concatenated. O Voyeur Options - The Progress Indicator can be brought up automati- cally when the ``Run'' or ``Spawn'' buttons are pushed. Normally the Indicators are not shown. Preferences are stored between sessions in .igorrc o ``Spawn'': Starts Igord on the entire set of hosts. o ``Run'': Runs the current set of commands on the entire set of hosts. o ``Re-Spawn Sel'': Starts Igord on the selected hosts o ``Re-Run Sel'': Runs the current set of commands on the selected hosts. o ``Stop Spawn'': Stops a spawn in progress. Any ``spawn'' that is currently being tried on a host is finished first. o ``Stop Run'': Stops a run in progress. If a host is already active, the run is finished on that host. o ``Clear Data'': Clears the ./Idata directory and removes all of Igor's information on its contacted hosts. o ``Port #'': Multiple Igord's can be run on a system simultaneously. This allows you to control which one you're talking to. o ``Rescan'': Retrieves the current set of data for each host, refresh the host status window, re-apply the regular expressions to the output. o ``Scan Interval'': A rescan can be done at a regular interval. An inter- val of 0 stops the auto-rescan. o ``View Sel.'': Allows you to view the current data [Figure 9] retrieved for each host selected. o ``Save Sel.'': Will save the list of selected hosts to a file. This allows you to create lists of hosts split up based on the results obtained. For example, saving all of the hosts which fail a certain test so that they can be corrected later. o ``Forget'': Remove the selected hosts from the current host list. ------------------------------------------------------------------ Figure 6: Results of running against various hosts Pressing the right mouse button in Igor will bring up another panel with addi- tional buttons [Figure 12]: o ``Archtype Sel'': Shows the architecture type of the selected hosts. o ``Watch Run'': Pops up a Progress Indicator for each Run currently in progress. o ``Watch Spawn'': Pops up a Progress Indicator for each Spawn currently in progress. o ``Kill Spawn w/Prejudice'': Stops a Spawn immediately. Does not finish the hosts currently being worked on. This can leave the remote hosts half- done. (Daemon is there, but not running for example. o ``Kill Run w/Prejudice'': Stops a Run immediately. Does not finish the hosts currently being worked on. This can leave the hosts having executed only some (or none) of the commands sent to it. Still, if you make a mis- take, this button is your friend. Igor's Security Igor's security is based on the BSD rexec(3N)-style security of ~/.rhosts file for each operator they wish to trust, or a /etc/hosts.equiv file listing all of the hosts and users that they trust for Igor activity. The daemon, upon connect from a Controlling system will verify that the remote system is trusted. Having verified that, will accept commands from the Controlling system. If the trust check fails, no commands are accepted and an error message is printed on the socket. Please note that the distributed ver- sion of Igor does not use a ``trusted'' port, and is for experimentation only. Simply changing the port usage on the daemon and the GUI will make Igor use a trusted port and a little more secure. ------------------------------------------------------------------------------ Figure 7: New set of commands ------------------------------------------------------------------------------ Figure 8: New set of regular expressions ------------------------------------------------------------------------------ This security, although old and not state-of-the-art is no less secure than what is used for rsh. Internally to Ford, this is generally adequate. Each workstation trusts a centrally located server, and our network security is handled by third party sources. Because we trust our network services and the centrally administered host, rexec security is adequate for our pur- poses.. There are certainly other ways to make Igor more secure. For example, using a PGP encrypted copy of the script to transmit to the remote daemons. The operator at the GUI could be queried for the encryption key. The decryp- tion key could be located by querying another (or the same) host and having obtained the key, you could decrypt the Igor commands, ensuring that they came from the correct host. Igor's code is fairly straightforward, and could be easily changed to accept these modifications. Igor Scripts The scripts that Igor runs are nothing more than a way of wrapping up shell scripts, tar files, and simple shell commands so that Igor can make sense of them at the remote side and run them. The various commands are: o do args - Run args as a shell command (/bin/sh). Eventually, everything except the ``do'' is passed to a perl script (on the remote end) and run as a ``system'' command. Normal shell argument parsing will take place on the remote end. This is the most commonly used Igor command. o EVAL args - Run ``args'' as Perl commands. This can be used to run perl commands directly by the remote daemon. Another use is to add functional- ity, on the fly, to the daemon by having it ``eval'' new functions. It can be used to add timeout capability to various Igor commands (``do''). See examples below. This is generally safe to use, because the daemon that's being modified with EVAL is simply a child of the daemon listening to the port on the remote system. Any potential defects in the child dae- mon do not affect its parent. o openfile file mode - Open file as a ``here'' file with mode specified. This is one method of transmitting lengthy shell scripts to the remote system. Binary data being sent must be uuencoded because the Igor ``script'' exists for a while in a TCL list, which can't contain binary data. Other ways of transmitting scripts, patches, programs, etc... include using an NFS mount publicly exported Read Only from a common sys- tem, retrieving the data through an ftp script, or accessing it from a webserver with a URL. (There is a short Igor script using EVAL which can enable Igor to do HTML retrievals.) o closefile - End of openfile block. o id - Igor will report its version number, local hostname, remote hostname (controller), date, time, local system architecture type, and other use- ful information. o quit - Terminate the remote Igord, transmit results. This command is REQUIRED at the end of a script, and will be inserted if you do not use one. The commands are given to the GUI by pressing the ``Edit'' button in the Remote Commands window and using your favorite editor to enter commands. Pre-assembled lists of commands can be loaded using the ``Load'' button. Scripts are saved with a common file extension (.cmds) to distinguish them from other files. Script Examples To transmit a small shell script and run it: openfile /tmp/fixbugs.sh 0755 #!/bin/sh echo "Then a miracle occurs here" install_miracle_patch closefile do /tmp/fixbugs.sh do rm /tmp/fixbugs.sh quit To check disk space in / and /tmp: do df /tmp / quit Using the EVAL function, some additional functionality can be added to scripts: ------------------------------------------------------------------ Figure 9: Sample of retrieved data EVAL sub timeout { next MAINLOOP; } SIG{'ALRM'}='timeout'; alarm(10); do function_that_may_hang EVAL alarm(0); quit This adds a timeout to the ``function_that_may_hang''. If the program doesn't return, Igor catches an alarm signal and continues executing the script after the questionable function. This required some knowledge if the innards of Igor, but these tricks are well documented. Output Analysis One of Igor's most important features is analyzing the output as it comes back from the remote system. In the GUI, each system is shown, with a count of the number of lines of STDOUT and STDERR reported. Sometimes this is enough to tell if everything worked OK. Also in that window is a field labeled ``Pass/Fail''. This field can also be used to tag each system with a Passed/Failed status. That is done by using the Regular Expression matcher. This area takes input in the form: STDOUT Exp1 Exp2 Expn STDERR Exp1 Exp2 Expn BOOL Boolean Expression The Expressions are Perlish regular expressions (without the //'s). Slashes and special characters must be quoted. These expressions are matched against successive lines of Standard Output or Standard Error and so long as the expressions match, the associated tokens (STDOUT, STDERR) will evaluate to true. If a regex does not match, the token gets set to false. The regular expressions are associated with the token they follow. The BOOL token indi- cates that the next line will contain an expression that will evaluate to true or false. The Boolean Expression is a perlish thing that is going to get EVAL'd. ``STDERR'' gets substituted with 1 for a match and 0 for a nonmatch ``STDOUT'' gets substituted with 1 for a match and 0 for a nonmatch. Depend- ing on the outcome, the system will be marked as ``passed'' or ``failed'' in the status screen. This sounds complicated, but in practice is a rather simple way of checking output. For example, to consider all systems that report something on STDOUT and nothing on STDERR as ``Passed''. you could use this arrange- ment: STDOUT . STDERR . BOOL STDOUT && !STDERR STDOUT gets set to 1 (true) if any single character is matched. STDERR gets set to 1 (true) if there's any STDERR output. If the BOOL expression evalu- ates to true, the system is tagged as ``Passed'', otherwise ``Failed''. Some- thing more complicated could be used like this: STDOUT 9[1-9]% STDERR . BOOL !(STDOUT || STDERR) For the set of commands: do df Would report ``Failed'' if the ``df'' command reported any filesystem more than 90% full, or ``df'' reported anything on STDERR. There's two other special tokens that can be used in the Boolean expres- sion in addition to STDERR and STDOUT. These are STDERRCNT and STDOUTCNT. These represent the number of lines of output on each file descriptor. For example: STDERR . BOOL ( STDOUTCNT > 3 ) && ( ! STDERR ) This would return true (passed) if there were more than three lines on stout, and nothing on stderr. This would be useful if the expected output could have a variable number of lines, but no errors should be expected. The Pass/Fail indicators shown in the hostlists are generated every time the host list is displayed. So if you decide that a different pass/fail cri- teria is necessary for your hosts, you can change the regular expressions and rescan the host list. You do not need to re-run the commands on the remote hosts. Sample Walkthrough To demonstrate Igor's true usefulness, what follows is a walkthrough of a sample Igor session. The hypothetical problem will be adding a resolver to each workstation's /etc/resolv.conf. The operator would first login to a trusted host, and start the GUI [Figure 1]. Next the operator can load the list of hosts to be operated on, or can enter them in [Figure 2]. First, in order to make our scripting a little easier, to find out which hosts already have the correct resolver we'll use the script shown in Figure 3. The regular expressions list is shown in Figure 4. These are loaded from a file browser. ------------------------------------------------------------------------------ Figure 10: Host selection dialog ------------------------------------------------------------------------------ Figure 11: User preferences screen ------------------------------------------------------------------------------ Now we're ready to contact the hosts. Clicking on the ``Run'' button will cause Igor to contact the remote hosts, and end them the script to be run. A progress indicator [Figure 5] lets the operator know how many hosts are untouched, being-worked-on, or have completed the commands. When the sliders indicate everything is done (or even before that) the operator an click on ``Refresh'' and that will cause the RE's to be run against the results obtained. The results are shown in Figure 6. Some hosts passed OK (they already have the resolver in the file) some did not. The hosts which pass can be selected with the mouse, and then the ``forget'' button pressed. This will drop these hosts from the host list. We do this, because only the hosts that need work are left in the list. ------------------------------------------------------------------ Figure 12: More buttons The remaining hosts are left onscreen. We can then load in a new set of commands [Figure 7], and a new set of regular expressions [Figure 8]. These will actually add the new resolver into the resolv.conf file. Pressing ``Run'' will cause the progress indicators to reappear, and when Igor is all done, we can see which hosts were modified successfully, and which were not. If problems appear during the run, there's quite a few things that can be done to diagnose what happened. To actually look at the data retrieved from the hosts by selecting the hosts we're interested in, and then clicking ``View Selected''. The raw data retrieved from the remote host is shown in a vi session. A sample of the retrieved data is shown in Figure 9. The initial information shows the connection being established, the standard output and standard error are shown, separately. From this information you might be able to tell what's wrong on the host. If more diagnostics (or repairs) are needed to individual hosts, the operator can double-clicks on a host in the list. An Xterm will open (running ``rsh host'') so that he can check things out manually. If a large number of hosts failed, the operator can rewrite the script and try it again. Cautions If Perl is the ``Swiss Army Chainsaw'' of UNIX, then Igor is a Gatling Gun loaded with Swiss Army Chainsaws - a useful tool or a terrible weapon of destruction. Igor is the fastest way we know of to fix problems on our 700 hosts - it's also the fastest way to cause them. It should probably not be used by anyone who doesn't understand how it works. For example, if you were to mistype a Igor script to do ``/bin/rm -rf /tmp/*'' and had typed ``/bin/rm -rf /tmp /*'' then ALL of your systems would be instantly erased. This is a normal pitfall for system administrators, it's only magnified with Igor. The thought of implementing Operator safety features to Igor has entered our minds (``Are you sure?'' type questions, etc...) and then swiftly left. One of the most powerful features of Igor is the fact that it doesn't get in your way. Once the structure of the commands are learned (there are only 4) and you've collected enough post-processing templates Igor just lets you do whatever is necessary...quickly. No fuss, no muss. Author Information Clinton Pierce is a System Administrator for Decision Consultants, Inc. and is currently assigned to Ford Motor Company. He is currently involved with integrating Solaris, AIX and IRIX workstations into a common look-and- feel environment. In addition, he teaches UNIX and Perl to consultants for DCI. Clinton can be reached via e-mail at cpierce1@ford.com, or U.S. Mail at Ford Systems Integration Center, 1000 Republic Drive, Suite 600, Allen Park MI 48101. Bibliography [1] Welch, Brent 1995 Practical Programming in Tcl and Tk Prentice Hall, Englewood Cliffs, NJ. [2] Wall, Larry, and Schwartz, Randal L. 1991 Programming Perl O'Reilly & Associates, Sebastopol, CA. [3] Lombardi, Christine, and Desimone, Salatore 1993 ``Systcl,'' Proceedings of the 1993 USENIX LISA Conference pp. 133, Monterey, CA. [4] Stevens, Richard W. 1992 Advanced Programming in the UNIX Environment. Addison-Wesley, Reading, Mass. [5] AIX DSMIT Guide and Reference Version 2.2 1994, Pub. Number SC23-2667