small tools for automatic text generation
by Diomidis Spinellis Diomidis Spinellis holds a PhD in computer science. Currently he is leading software development at SENA S.A. His research interests include information security, software engineering, and programming languages. The Problem The textual description and analysis of large data sets can be a repetitive and error-prone task. Examples of data sets that may need to be textually described include population statistics, market research results, scientific and engineering data such as chemical compounds or earthquake structural damages, and literature surveys. In all such cases the data need to be organized and presented in a meaningful way. In some cases this process is periodically repeated (e.g., market research results); in other cases the base data set is frequently updated (e.g., a literature survey). Automating the data presentation process can reduce the work required to generate the reports, eliminate a source of errors, enhance the outcome's consistency, and speed up the generation process. In the following paragraphs I describe a small set of special-purpose tools developed for the automatic production of a multiparadigm language literature survey. A set of 104 languages combining different programming paradigms needed to be presented in a meaningful way, tabulated, indexed, and cross-referenced. I decided to divide them according to the paradigms they supported and present them by listing certain characteristics of each language. During the course of my investigation the set of languages surveyed grew by 100% and was frequently revised; this made the hand editing of the text error prone and unproductive. Although the tools developed are special purpose, the methods used to construct the tools and generate the output can be applied in a number of different situations. Functional Description The results of the survey were entered into a simple database structured as a text file. The file is structured similar to a refer database: records are separated by empty lines, and record fields are identified by a letter following the percent character at the beginning of a line. Figure 1 shows a sample record from the database. The record's fields are the name of the language N, significant references R, its characteristics C, its usual implementation I, the paradigms it supports P, and a short descriptive text D.
%N Modula-Prolog
The text generator scans the database and divides the languages into categories according to the paradigms supported by each language. Each such category (e.g., languages that support the logic and object-oriented paradigms) is formatted as a separate text section. The generator formats the title as in the following example: 2.2.5 Combinations of Imperative and Logic Paradigms It then inserts some manually prepared descriptive text and appends text that tells the reader the number of languages in that category, provides pointers to the summarizing tables (examples are Tables 1 and 2), and introduces the paragraphs to follow. The following is an example of the automatically generated text: We found ten languages that combine the imperative and logic programming paradigms. Their implementations are summarized in Table 1 and their characteristics in Table 2. In the following paragraphs we list the most important features of each language.
Table 1: Implementations combining the imperative and logic paradigms
A description of the languages based on the R, D, and N record fields is then produced: 2.PAK [Mel75] Block structured language offering user-defined pattern matching and backtracking. The generator inserts at the end of each section another tailor-made paragraph containing concluding remarks on that section. A special section contains all languages that could not be fitted into one of the preceding sections together with a special table listing the paradigms each language supports (Table 3). After all sections have been processed the generator produces a summary of the number of languages supporting each paradigm combination (Table 4).
Table 4: Summary example Implementation I implemented the generator as a set of ten small tools using a separate program to administer the database in cases where a text editor was not adequate. Each of the ten tools performs a small specialized task. The tools are implemented in the Perl and Bourne shell (sh) languages, making use of additional UNIX tools such as grep and sed. The system's driver is the maketext program, which, given a list of interesting paradigm combinations, generates the section title and the opening paragraph. It then calls the external program's desclist to create the description list and chartable/imptable to create the characteristics and implementation tables. After the whole database has been processed, the same program generates the summary table. One other program, partable, generates the language/paradigm table for the last catch-all section. All programs take as an argument the paradigm combination described in the section processed, or (for the last section) the paradigm combinations already processed. Maketext is implemented in Perl, making extensive use of regular expressions to parse the database. The second-level programs operate on the output of the pars program, which, given a paradigm combination, generates a sorted list of all records exactly matching that paradigm combination. A similar program, chars, generates only the characteristics of the paradigm combination. Pars depends on its operation on the dbgrep program, which scans the database for records matching the search criterion specified as an expression possibly containing string regular expressions. Dbgrep also takes a flag to display all records not matching the criterion. This feature is used in the generation of the last section. In order to match the record output order of these programs and create meaningful tables, the records and fields are sorted in various phases of the text generation, using either the sort statement available in Perl or two additional programs, linesort and llinesort, which sort the words on every input line. Llinesort performs this operation only on lines matching a specific pattern. This division among the many specialized programs resulted in a modest implementation effort: 697 lines of code divided as indicated in Table 5. The largest program, maketext, is only 117 lines long including the text that is copied verbatim to the output file. The output of the text generator is in the LaTeX text markup language. The 1134 line language database is converted into 1464 lines of LaTex, which also include commands to include another 371 lines of human-generated text. Conclusions The database described in the previous sections and the associated tools were used for almost three years. During that time, the database was frequently edited and revised. In some cases the output format was also modified. Both types of changes would have been very difficult if the data had been statically embedded in a document. The approach I described was used to change the structured database, and a single command reflected them in the camera-ready output. The ease of adding new records and modifying existing ones encouraged me to keep the database current. I have thus found that special-purpose throw-away tools can be effectively used to generate text automatically from structured data collections. For those who might be interested to use this approach in other domains, the following suggestions may be of help:
Table 5: Implementation effort and details
|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
First posted: 17th September 1998 efc Last changed: 17th September 1998 efc |
|