SAGE - Perl Practicum - The Swiss Army Chainsaw

Perl Practicum: The Swiss Army Chainsaw

by Hal Pomeranz

A Dab of Philosophy

I was expounding upon the glories of Perl to one of my colleagues the other day and he remarked that Perl seemed rather contrary to the UNIX design philosophy. Everybody has a different take on this issue, but combining simple tools to form more complex ones does appear to be a UNIX fundamental. Perl, on the other hand, intentionally provides a rich set of features which can, in turn, emulate a wide variety of different UNIX utilities. Of course Perl makes it easy to invoke various UNIX tools - it wouldn't be nearly as useful a language otherwise. Often, it will enhance the readability of your program to use different programs via open() or with backticks rather than trying to write the same function in Perl. There are real reasons, however, why this may be suboptimal.

First, there's the efficiency bogeyman. Certainly there is a great deal of overhead in setting up another process for execution. Of course, as the size of the data set you are processing grows, this overhead may become insignificant. As with all optimizations, you should experiment: try different solutions on real data sets and see which approach is most efficient.

A more telling argument in favor of avoiding vendor-provided tools is portability. If you have ever main- tained software across multiple platforms, then you know how difficult it is to find the utility you need sometimes. Where does the find command live on all of your systems - /bin, /usr/bin, /usr/ucb, some other evil hidden location? Does it accept the same set of options on all of your platforms? Multiply these issues by the number of different UNIX utilities that your Perl programs would like to use, and suddenly the cost of reimplementation appears much lower. If you maintain the same revision of Perl across all platforms, you have a consistent basis to work from. As further incentive, this installment will demonstrate some simple methods for emulating various UNIX utilities in Perl.

Cognates

Certain UNIX utilities translate directly to built-in Perl functions. Perl has built-in chown(), chmod(), mkdir(), and rmdir() functions. There's also link() and symlink() for creating hard and symbolic links respectively, as well as unlink() for removing files, and even a rename() function to partially emulate mv.

The chown(), chmod(), and unlink()functions will accept a list of files to operate on and return the number of successes. Sometimes, however, you want to know exactly which operations failed. In this case, use a loop over the individual elements of your list of files:

        for (@files) {
             chown(0644, $_) || warn "Can't change permissions on $_\n";
        }

A similar strategy can be used for those file operations which do not operate on lists.

Note that if your operating system does not support one of the above function calls, you will encounter various failure modes (some more graceful than others). For example, if symbolic links aren't supported on your system, then the symlink() call will cause your program to die at runtime with a fatal error. Like any function you are unsure of, you should wrap the call in an eval() statement to trap possible errors:

        eval "symlink($old, $new);";
        warn "Symlink not supported\n" if ($@);

The $@ variable is guaranteed to be the null string if the eval() succeeds, so this is a reliable test.

Sometimes Perl will simply invoke the appropriate operating system tool if a function is not provided as a library call: the mkdir() function is the classic example of this. In this case, it is probably more efficient to call the program once yourself with a list of directories, rather than spawning a process for each individual directory you wish to create.

Subtlety

Certain Perl functions are closely related to UNIX filters. For example, split() and substr() emulate cut very closely. Perl has a builtin sort() function that is much more powerful than the UNIX sort utility, but you have to define your own selection routines to do really tricky sorts. (See the first Perl Practicum for more information on devious sorting). Sometimes, though, you have to change your thinking a bit to get Perl to do what you want.

For example, programmers often like to use basename and dirname to get the name the current program was called by (for error messages) and the directory it was called from. Perl stores the full invocation pathname of the program in the variable $0 and basename and dirname can be emulated with appropriate substitutions:

        ($basename = $0) =~ s%.*/%%; 	
        ($dirname = $0) =~ s%/[^/]*$%%;

The first substitution takes advantage of Perl's greedy pattern matching algorithm to eat up everything up to the last `/' in the pathname and throw it away. If you're interested in both the directory and the file name, you can use the following one-liner:

        ($dirname, $basename) = $0 =~ /(.*)\/(.*)/;

Again, we're making use of the greedy pattern match as well as the fact that pattern match returns a list of subexpressions in a list. The statement looks a little strange, but the precedence is correct.

Another common UNIX filter is uniq. Of course, you always have to sort your file before passing it to uniq because the tool will only recognize consecutive matching lines. Not so with Perl:

        open(FILE, "< myfile") || die "Can't open myfile\n"; 	
        while (<FILE>) { 	
             next if $seen{$_}++; 	
             ...do some processing here... 	
        } 	
        close(FILE);

Note that memory usage can get quite high if the file is large and doesn't have a great deal of repetition. On the positive side, the %seen array ends up having a count of the number of repetitions of each line, in case you care to emulate

uniq
-c

. You can always run sort() on the unique lines in the file if you really wanted the lines to be sorted.

The grep() function in Perl can be used to emulate UNIX grep:

        open(FILE, "< myfile") || die "Can't open myfile\n";
        @lines = <FILE>; 	
        close(FILE); 	
        @found = grep(/$pattern/, @lines);

This, however, can be rather memory intensive for large files. Instead, simply operate sequentially:

        open(FILE, "< myfile") || die "Can't openmyfile\n";
        while (<FILE>) { 	
             next unless (/$pattern/); 	
             ...process here...
        } 	
        close(FILE);

If you want a list of matching lines, rather than operating sequentially, just push() the matching lines into a list in the processing section. At least you save having to slurp the entire file into memory.

The Perl Library

As distributed with Perl pl36, the Perl library contains several packages which emulate useful UNIX utilities. Additional packages are available in the Perl archives on coombs.anu.edu.au (150.203.76.2). Be sure to check there before reinventing the wheel.

You use a package by first "requiring" it and then calling the functions it contains as you would any user-defined function. For example, the ctime.pl package provides a simple work-alike for the UNIX date command:

        require "ctime.pl";
        $date_str = &ctime(localtime);

Of course, you don't get the formatting string capabilities that some date commands provide, put you can always use localtime() and printf() to emulate this behavior.

Also in the easy to use category are the getcwd.pl and fastgetcwd.pl libraries to help you find where you are in the directory tree. The function defined in fastgetcwd.pl is more efficient because it uses chdir() to traverse a path up to the root, but you might not be able to get back where you started from once you chdir() out. For those of you who like the $PWD variable under the C shell, there's the pwd.pl library. Simply &initpwd() after requiring the library and then use the package-defined &chdir() function instead of Perl's built-in chdir(). The &chdir() function will continuously update the $PWD environment variable.

Far and away the most useful volume in the Perl library, though, is find.pl. The union of find options across all UNIX platforms throughout history is tremendous, but their intersection is often minimal. Writing a find command that works on every platform at your site can be a study in constraints. Actually, the find.pl library really exists to drive the find2perl program provided with the Perl distribution, but you can require find.pl directly in a program of your own devising. (The find2perl program will emit a complete Perl program which will exactly match the behavior of the find options fed to find2perl on the command line.)

The &find() function defined in the library accepts a list of file and directory names as arguments and will traverse all files and subdirectories just as the UNIX find command does. For each item in the traversal, &find() will call a user-defined subroutine &wanted. No arguments are passed to &wanted(); but $dir contains the pathname of the current directory, $_ the name of the item currently being considered, and $name the full pathname "$dir/$_". If $_ is a directory, the &wanted function can set the variable $prune to true to stop the &find() function from descending into $_.

Beyond that, the processing done in &wanted is entirely up to the user. Judicious use of the stat() or lstat() function and the Perl file test operators can emulate most of the options supported by the UNIX find command. (Don't forget that the $_ variable caches the result of the last stat() or lstat(), whether the function was called directly or via one of the file test operators). Of course, Perl is a much richer and more powerful language than the command-line syntax of find, so some extremely powerful effects can be obtained. For more guidance, run find2perl on some of your favorite find invocations and study the output carefully. (There are probably dozens of good candidates in /usr/spool/cron/crontabs/*alone.)

Plenty of other tools are available in the Perl library. In look.pl, there's a dictionary lookup that emulates the UNIX look command. There's even a syslog interface in syslog.pl in case you hate calling logger all the time. More libraries are being invented, posted to comp.lang.perl, and archived at coombs every day.

Enough Already

Hopefully by this point I've convinced you that Perl is more than capable of emulating most simple (and some complex, e.g., the s2p and the a2p translators that come with the Perl distribution) UNIX tools. Sometimes, though you really need to call some tool outside of Perl. For example, I have yet to find anything better than:

       chop($hostname = `/bin/hostname`);

So, next time we'll be talking about strategies for writing portable Perl scripts across a bewildering variety of UNIX implementations with subtly different pathnames and command behaviors. Tentative title to be, "The Thing I Love About Standards Is That There Are So Many."

Reproduced from ;login: Vol. 18 No. 6, December 1993.

Need help? Use our Contacts page.

Last changed: May 24, 1997 pc

Perl index

Publications index

USENIX home