http://www.spinellis.gr/pubs/trade/1998-Login-TextTools/html/paper.html This is an HTML rendering of a working paper draft that led to a publication. The publication should always be cited in preference to this draft using the following reference: The document's metadata is available in BibTeX format. Find the publication on Google Scholar This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder. Diomidis Spinellis Publications |
Diomidis Spinellis
SENA S.A.
Byzantiou 2
GR-142 34 Nea Ionia
Greece
e-mail: dds@senanet.com
March, 1998
The textual description and analysis of large data sets can be a repetitive and error-prone task. Examples of data sets which may need to be textually described include population statistics, market research results, scientific and engineering data such as chemical compounds or earthquake structural damages, and literature surveys. In all such cases the data needs to be organised and presented in a meaningful way. In some cases this process is periodically repeated (e.g. market research results); in other cases the base data set is frequently updated (e.g. a literature survey). Automating the data presentation process can reduce the work required to generate the reports, eliminate a source of errors, enhance the outcome's consistency, and speed-up the generation process.
In the following paragraphs I describe a small set of special-purpose tools developed for the automatic production of a multiparadigm language literature survey. A set of 104 languages combining different programming paradigms needed to be presented in a meaningful way, tabulated, indexed, and cross-referenced. I decided to divide them according to the paradigms they supported and present them by listing certain characteristics of each language. During the course of my investigation the set of languages surveyed grew by 100% and was frequently revised; this made the hand editing of the text error-prone and unproductive.
Although the tools developed are special purpose, the methods used for the construction of the tools and the generation of the output can be applied in a number of different situations.
The results of the survey were entered into a simple database structured as a text file. The file is structured similar to a refer database: records are separated by empty lines, and record fields are identified by a letter following the percent character at the beginning of a line. A sample record from the database can be seen in figure 1. The record's fields are the name of the language N, significant references R, its characteristics C, its usual implementation I, the paradigms it supports, P and a short descriptive text D.
%N Modula-Prolog %R Mul86 %C UN RL RT nDT BR... ...a-2. The library includes term handling procedures. |
The text generator scans the database and divides the languages into categories according to the paradigms supported by each language. Each such category (e.g. languages that support the logic and object-oriented paradigms) is formatted as a separate text section. The generator formats the title as in the following example:
2.2.5 Combinations of Imperative and Logic ParadigmsIt then inserts some manually prepared descriptive text and appends text that tells the reader the number of languages in that category, provides pointers to the summarising tables (examples are tables 1 and 2), and introduces the paragraphs to follow. The following is an example of the automatically generated text:
We found ten languages that combine the imperative and logic programming paradigms. Their implementations are summarised in table 1 and their characteristics in table 2. In the following paragraphs we list the most important features of each language.
A description of the languages is then produced based on the R, D, and N record fields:
...
- 2.PAK
- [Mel75] Block structured language offering user-defined pattern matching and backtracking.
- C with Rule Extensions
- [MS90] Based on the C programming language with an extended syntax, a richer set of data types, a flexible input/output system and a forward chaining [Ric83, p. 56] execution strategy.
- Leda
- [Bud91] Language with syntax similar to that of Pascal, with an additional code abstraction facility, the relation. The data-space for all entities contains the undefined value. Relations are coded as Prolog rules, and allow backtracking.
Name | References | Implementation |
2.PAK | [Mel75] | Language |
C with Rule Extensions | [MS90] | Extension of C, preprocessor |
Leda | [Bud91] | Language |
Logicon | [LC86] | Prolog interpreter in Icon |
... |
Name | Characteristics | Control | ||||||
BR | R | RT | ||||||
2.PAK | * | |||||||
C with Rule Extensions | *, X | |||||||
Leda | SLD, X | |||||||
Logicon | X, SLD | |||||||
... |
The generator inserts at the end of each section another tailor-made paragraph containing concluding remarks on that section. A special section contains all languages that could not be fitted into one of the preceding sections together with a special table listing the paradigms each language supports (table 3). After all sections have been processed the generator produces a summary of the number of languages supporting each paradigm combination (table 4).
Name | Paradigms |
DSM | Imperative, Object-oriented and Relational |
Echidna | Constraint, Logic and Object-oriented |
Educe | Database and Logic |
Enhanced C | Imperative and Set |
Fooplog | Functional, Logic and Object-oriented |
Icon | Generators and Imperative |
KE88 | Functional, Logic and Object-oriented |
Kaleidoscope | Constraint, Imperative and Object-oriented |
Lex | Imperative and Regular-Expression |
ML-Lex | Functional and Regular-Expression |
... |
Functional | |||||||||
Imperative | |||||||||
Object-Oriented | |||||||||
Logic | |||||||||
Distributed | |||||||||
Constraint | |||||||||
Number of languages | 24 | 10 | 9 | 8 | 11 | 7 | 5 | 4 | 5 |
I implemented the generator as a set of ten small tools. A separate program was used to administer the database in cases where a text editor was not adequate. Each of the ten tools performs a small specialised task. The tools are implemented in the Perl and Bourne shell (sh) languages making use of additional Unix tools such as grep and sed.
The system's driver is the maketext program which given a list of interesting paradigm combinations generates the section title and the opening paragraph. It then calls the external programs desclist to create the description list and chartable/imptable to create the characteristics and implementation tables. After the whole database has been processed the same program generates the summary table. One other program, partable, generates the language/paradigm table for the last catch-all section. All programs take as argument the paradigm combination described in the section processed, or for the last section the paradigm combinations already processed. Maketext is implemented in Perl, making extensive use of regular expressions to parse the database.
The second level programs operate on the output of the pars program which given a paradigm combination generates a sorted list of all records exactly matching that paradigm combination. A similar program, chars, generates only the characteristics of the paradigm combination.
Pars depends on its operation on the dbgrep program which scans the database for records matching the search criterion specified as an expression possibly containing string regular expressions. Dbgrep also takes a flag to display all records not matching the criterion. This feature is used in the generation of the last section.
In order to match the record output order of these programs and create meaningful tables the records and fields are sorted in various phases of the text generation, either using the sort statement available in Perl, or two additional programs linesort and llinesort which sort the words on every input line. Llinesort performs this operation only on lines matching a specific pattern.
This division among the many specialised programs resulted in a modest implementation effort -- 697 lines of code divided as indicated in table 5. The largest program, maketext, is only 117 lines long including the text that is copied verbatim to the output file.
Program | Implementation | Lines |
chars | Sh | 13 |
chartabl | Perl | 59 |
dbgrep | Perl | 27 |
desclist | Perl | 17 |
imptable | Perl | 38 |
linesort | Perl | 19 |
llinesor | Perl | 26 |
maketext | Perl | 117 |
pars | Sh | 13 |
partable | Perl | 27 |
Total | Perl | 697 |
The output of the text generator is in the LaTeX text markup language. The 1134 line language database is converted into 1464 lines of LaTex which also include commands to include another 371 lines of human-generated text.
The database described in the previous sections and the associated tools were used for a period of almost three years. During that time the database was frequently edited and revised. In some cases the output format was also modified. Both types of changes would have been very difficult if the data was statically embedded in a document. Using approach I described, changes were made in the structured database and a single command reflected them in the camera-ready output. The ease of adding new records and modifying existing ones encouraged me to keep the database current.
I have thus found that special purpose throw-away tools can be effectively used to automatically generate text out of structured data collections. For those who might be interested to use this approach in other domains the following suggestions may be of help: