http://www.spinellis.gr/pubs/conf/1997-DSL-Lightweight/html/paper.html This is an HTML rendering of a working paper draft that led to a publication. The publication should always be cited in preference to this draft using the following reference:
|
Diomidis Spinellis
V. Guruprasad
University of the Aegean
IBM T. J. Watson Research Center
83200 Karlovassi, Samos
Yorktown Heights,
Greece
NY 10598, USA
dspin@aegean.gr
prasad@watson.ibm.com
Software subsystems can often be designed and implemented in a clear, succinct, and aesthetically pleasing way using specialized linguistic formalisms. In cases where such a formalism is incompatible with the principal language of implementation, we have devised specialized lightweight languages. Such cases include the use of repeated program code or data, the specification of complex constants, the support of a complicated development process, the implementation of systems not directly supported by the development environment, multiparadigm programming, the encapsulation of system level design, and other complex programming situations. We describe applications and subsystems that were implemented using this approach in the areas of user interface specification, software development process support, text processing, and language implementation. Finally, we analyze a number of implementation techniques for lightweight languages based on the merciless exploitation of existing language processors and tools, and the careful choice of their syntax and semantics.
Software subsystems can often be designed and implemented in a clear, succinct, and aesthetically pleasing way using specialized linguistic formalisms. In cases where such a formalism is incompatible with the principal language of implementation, we have devised specialized lightweight languages. The soundness of this approach is amply demonstrated by the use of declarative languages for specialized applications [Hug90,Mos91], the attention given to very high level languages [Use94], and multiparadigm programming research [Hai86,SDE94]. The optimum formalism for implementing a subsystem is often incompatible with the language used for the rest of the system. In such cases, we propose the use of specialized lightweight languages, designed to closely match the formalism of the problem and to be easy to implement. In this way, the linguistic distance between the specification of the problem and the implementation of its solution can be minimized, resulting in cost reductions and improved project quality. In the following sections, we analyze our method's application domain, provide a number of representative case studies, outline the techniques we use, and list its advantages and associated problem areas.
Lightweight languages are particularly helpful in cases where a classical implementation would require:
In addition to the above application domains, lightweight languages can be beneficially employed to support:
In other cases, a system's application environment may impose unnatural restrictions on the way a user interface is implemented e.g. by requiring that parts of a given command functionality be dispersed among different event procedures, resource declarations, and icon/help text definitions. A simple language can be used to define the functionality in an organized way and automatically create the code in the format required by the application environment.
The specialized language can be implemented as an interpreter or as a compiler. The compiler is usually implemented as code generator whose target language is the main implementation language (examples 3.1.1, 3.4.1), but it can also be implemented as a subsystem that performs the translation at runtime as in example 3.3.1. Furthermore, the language can be implemented as a built-in interpreter as in example 3.3.1 aiding the easy modification of a system's parameters at runtime. The Unix termcap database and the application described in section 3.3.2 are examples of implementations as data stream filters.
Building interpreters and compilers for one-time use makes it a routine skill that becomes easy to apply again and again in different ways. It also helps in developing the analytical ability for dividing and conquering problems, and for orchestrating available tools, as illustrated in [SK97] and sections 3.1.2 and 3.1.3.
Task | Source | Compiler | Output |
(lines) | (lines) | (lines) | |
GUI Strings | 468 | 67 | 771 |
User Messages | 2490 | 65 | 4345 |
Layer Control | 91 | 244 | 711 |
Table Contents | 266 | 618 | 1901 |
Toolbars | 373 | 187 | 1568 |
Menus | 42 | 81 | 401 |
Serialization | 1174 | 297 | 6191 |
Commands | 1214 | 248 | 7367 |
Properties | 1093 | 839 | 14632 |
Total | 7211 | 2646 | 37887 |
One of the specifications for the system called for a property dialog box similar to the one found in the Delphi, MS-Access, and Visual Basic programming environments. The system supports 457 properties divided into 30 groups. Every property was given a property type such as number, angle, color, menu selection, yes/no, dialog box, and file name. We then formulated the properties in the language as illustrated in the example in Figure 1. A compiler translates the property definitions into three C code files. One of the files contains the variable and procedure definitions, the other, the procedures for initializing the supporting data structures, and the third one contains code for displaying the properties.
// Standardized selections #define ANGLE 0;45;90;135;180;225;270;315 #define TSIZE 6;8;10;12;14;16;20;24 // Text object property selection #ifdef TEXT //Type Caption Variable Range Fmt Sel menu :Text :prop_text double:Size :cb->size :0 :100:%.3f:TSIZE angle :Angle :cb->theta :0 :90 : :ANGLE bool :Enhance:cb->enhance: : color :Color :cb->color : : separator selres:Font :cb->font :3 :font_dir sel :Allign :cb->allign :Left;Right;Center dialog:More...:CTextDialog:textdlg #endif |
The compiler consists of 839 lines of Perl code and produces 14632 lines of C code. Before the compilation, the source code is passed through the C language pre-processor allowing the use of macro instructions at a minimal cost.
The description file is compiled into C in a single pass by writing the declarations and the code definitions into two distinct files; the definition file includes the declaration file using the C #include mechanism. In implementing the above example we made extensive use of Perl's variable substitution facility for creating customized and efficient C code.
The software development process is a less mature field than programming in the small. The requirements, organization techniques, procedures, and development tools vary widely between organizations and projects. Lightweight languages can provide support for the above by automating repetitive work, organizing complicated tasks such as configuration control, and forcing ill-defined processes onto a formal specification vehicle. The following paragraphs outline two representative examples in the area of software development process support.
In a large application that we developed we had to specify the ordering of 530 files in the installation disks as well as the associated parameters. Some of the files were needed only during the setup stage and should be placed in the first disk, some of them should, additionally, not be compressed. Other files had to be installed into specific system directories while others had to be installed only as an installation option. This layout is typically specified by a GUI tool provided as part of the Microsoft Windows software development kit. The tool then distributes the files to disks and creates the installation driver program. In our case, the GUI tool could not handle the number of files needed by our application, and, in addition, had only limited grouping capabilities. We thus implemented a little language based on regular expressions that defines file sets and their installation specifications together with some global settings. Part of this specification can be seen in Figure 2.
# Anchor = /distrib/ # The following are copied verbatim =SRC = \distrib =WRITEABLE = 1 =DISKLABEL = Application Disk 1 # Installation files (do not compress) >DISK1,!COMPRESS,!VERSION,!READONLY, !DECOMPRESS,SECTION=Setup,. +setup.exe +setup.lst # Library files >ANYDISK,COMPRESS,!VERSION,!READONLY, DECOMPRESS,SECTION=LibFiles,. +lib/* # All other files (application files) >ANYDISK,COMPRESS,!VERSION,!READONLY, DECOMPRESS,SECTION=AppFiles,. +*.exe +helplib.dll +* |
The specification file is then translated into a file compatible with the one created by a GUI tool. In our case, the regular expression-based specification file is 66 lines long and creates a 560 line specification. The short length of our specification makes it comprehensible, readable, and ammendable to processing by other tools.
A 55 line Perl program handles the translation, identifies simple mistakes, and prints warnings for the files that fall into the default, the most general specification case, so that the user can verify that the files are given valid installation specifications.
%S description of step # list component files to work on %L a.c b.c c.c ... %D mkdepend.sh %s ... %M $(CC) -c $@ $*.c ...
As processor speeds increase, text is increasingly used as a common communication format between different subsystem parts in a quest for portability and maintainability. However, most compiled programming languages lack functionality for dealing efficiently and intuitively with text strings. The design of a small lightweight language to be embedded in a system can be used to ameliorate this deficiency by providing an efficient implementation of the facilities needed for using the specific textual interface. Two representative examples are described in the following paragraphs.
A statistical data relational database program we developed had to interface to a specialized video typesetting device. The device was controlled using Postscript [Inc85]. Structuring all the resulting code around Postscript was not feasible as Postscript is a relatively low-level language. Embedding large blocks of Postscript into the program was inelegant. We therefore defined a Postscript variant as a lightweight language that allowed for the embedding of meta-variable names. These names (all starting with the $ sign) were substituted at the runtime interpretation phase with the SQL query results. In order to allow for the definition of tabular data with variables of the same name the semantics of the language mimic linear logic in allowing each variable substitution to happen exactly once. An external loop within the code thus calls the interpreter multiple times for the same code body, each time setting the variable names to the values of the particular table row. After the implementation, all Postscript code that was embedded within the rest of the program was easily moved into external files. When a particular page needs to be displayed the file is loaded, interpreted, and downloaded into the Postscript typesetter. Most of the Postscript code of the deployed release was generated by experienced artists using specialized graphic design tools.
Either in response to a genuine need, or through the occupation of individuals with domain specific knowledge, the area of programming language implementation has traditionally been a hotbed for lightweight languages. The increasing use of multiparadigm and mixed-language programming, and the possible adoption of lightweight language tools that we advocate, increase the requirements for an efficient process of language implementation. We believe that the proliferation of languages used for implementing languages stems from the need to specify formal transformation rules for a wide variety of domains that occur in a language realization. The use of a special purpose (for mature tasks such as parsing) or a lightweight language can aid the specification -- and a subsequent implementation -- of tokens, grammars, type systems, tree transformations, optimizations, and machine code generators. Below we outline three representative examples from our own work.
We designed and implemented term as a lightweight language for the realization of the blueprint multiparadigm environment [SDE94]. For that environment we needed to implement a functional and a logic programming language. Such languages are easily implemented using declarative rules as meta-interpreters in their own language [SS86, p. 150] [FH88, pp. 193-195]. This practice is unfortunately associated with serious bootstrapping and performance issues. Based on the intuition that with only a moderate increase in code size the above languages can be implemented in a language that does not support important features of functional and logic programming such as deep unification, input/output parameters, backtracking, and higher order functions, we defined term as a simple tuple pattern matching, rule-rewrite system. Term is implemented in 2300 lines of term, yacc, and lex as a compiler from term to C. It was bootstrapped using a Prolog interpreter since the syntax of term is very similar to that of Prolog, lacking however many of Prolog's logical features. The use of term as an implementation vehicle proved to be a good choice, since we were then able to implement the functional and the logic languages in just 1300 lines of structured term code that was then compiled into efficient C code.
An experimental Prolog extension was also built using the dictionary objects as independent ``knowledge domains'', unlike most implementations of Prolog, which provide only a default namespace for the functors, and provided 4th escapes like:
% 4th.logic init file { % escape to perform inits in true 4th (/usr/lib/4th.init) ldsrc % push to logic domain stack /lathe_rules load beginlogic } % Note-- :n means n-th from top of stack. =(X,X) :- { :0 :1 unify }.The logic programming feature could be used, for instance, to integrate tool-specific reasoning in factory automation.
Task | Source | Compiler | Output |
(lines) | (lines) | (lines) | |
Character classification | 44 | 87 | 332 |
Tree inversion | 527 | 223 | 441 |
Tree printing | 527 | 244 | 1175 |
Tree copy | 527 | 206 | 507 |
Library functions | 178 | 131 | 410 |
Machine description | 604 | 274 | 745 |
Total | 1353 | 1165 | 3610 |
Lightweight languages are by their nature ephemeral and limited in scope. The effort expended in implementing them should be less than the effort required to build the system without their use. In the following sections we will outline some implementation techniques that we have successfully used to implement the languages we described in a robust yet economical way.
Many simple interpreters can be implemented using text processing tools such as awk [AKW88] and Perl [WS90]. This implementation avenue can be particularly effective if the language has been designed by following the guidelines outlined in the following four paragraphs. Interpreters and compilers for more complicated languages can always be modeled using the lex/yacc [JL87] model; often performing the translation or code generation during parsing without a need to create an intermediate syntax tree.
One other way to increase the capabilities of the lightweight language with little implementation cost is to utilize the power of the implementation tool. As an example, some tools have meta-linguistic capabilities allowing the interpretation of their own language. In this case the lightweight language source code can include expressions or statements written in the implementation language of the interpreter. In one of the languages we implemented we allowed the inclusion of full Perl expressions since these can be easily evaluated using Perl's eval function.
Simple but robust interfaces can be easily built even with ksh [Kor94]. We have used simple ksh loops for everything ranging from compilation scaffolds and simple Web servers to entire online services (sections 3.1.2 and 3.1.3). Including simple features like shell escapes and I/O redirection is not only convenient for Unix users, but also helps in scripting and logging production runs.
To successfully use the techniques described above it is important to design the lightweight language in such a way as to make it compatible in syntax, semantics, and form with the target language, the interpreter language, or the C preprocessor.
By keeping the lightweight language small and limited in scope it should be possible for a single person to design and implement it. By a judicious selection of syntax and semantics the language can be implemented without complicated processing. Many of the languages we have designed are line and not free-form based. They can thus be compiled one line at a time, sometimes by a simple application of pattern matching and regular expressions. Line comments are also easy to remove without complicated loops and the semantic problems associated with nested block comments. An even easier to compile form of language syntax is table-based (as the example in section 3.1.1) and can thus be easily processed by field-based tools such as awk. The processing of variables can often be simplified be prefixing them with a special character such as $.
Systems we have implemented made use of as many as 13 different languages. One cannot expect a person maintaining the code to be familiar with all of them. For this reason we try to make every source file self-documenting. We try to make the language similar to well-known languages and keep its syntax and semantics simple and intuitive. A comment at the beginning of each source file of no more than a dozen lines should be all that is needed to describe the language and its use. If this is not possible, then something may be wrong with the language design.
The software process covering the use of lightweight languages is not yet mature. Issues of lightweight language design in the areas of granularity, usability, interfacing, and architecture need to be examined and evaluated. The potential scalability limits of this approach should also be taken into consideration. We have used lightweight languages in projects composed of up to 200K lines of code without experiencing significant problems. Significantly larger projects may uncover problems in the areas of namespace pollution, tool portability, performance, group coordination, and resource management.
Furthermore, the developers and users of lightweight languages have to support their language on their own. Tools such as debuggers, metric analyzers, profilers, class browsers, and context-sensitive editors will be not be available. Users will have to do without them, handcraft their own, or resort to the lower level facilities provided by the underlying language. Other resources commonly available for mainstream languages such as trained developers, courses, libraries, books, and external consultants will also be absent. In addition, developers may need to reinvent techniques used for formal program validation and analysis, and re-establish metrics associated with the development process.
The problem of choosing an application's implementation language forms part of language design research [Wir74,Wex76,Hoa83]. In this section we examine some representative specialized languages that justify our approach. The implementation of a number of specialized languages can be found in [AKW88] and [Ben88, pp. 83-100]. A data driven approach to structure programs is recommended in [KP78, p. 67]: ``let the data structure the program.'' An example of a language with a narrow, specialized application domain is the one described in [CP85]. Specialized languages and tools for compiler construction are given in [JL87,Fra89,Spi93a]. Furthermore, specialized language dialects can be implemented using the system described by [CHP88].
The Unix operating system follows the tradition of using a specialized language for the definition of an application's startup status. Some of these languages are sufficiently advanced so that part of an application can often be implemented in them. As an example the sendmail mail transfer agent only implements a simple translation engine configured by a special startup file. All rules for routing and rewriting message headers are defined in that external configuration file (the infamous sendmail.cf file). Simpler yet useful examples of specialized languages in the Unix environment are the system state initialization file inittab, the terminal capability descriptions in termcap, the process schedule specification crontab, and the Internet daemon master switchboard inetd. The plethora of command initialization files for tools such as the X-Window system (.xinitrc), the ex editor (.exrc), the window manager (.twmrc), and the mail user agent (.mailrc) prompted the development of TCL [Ous94], an embedable, interpreted, string processing language aiming to provide a single consistent tool programming interface across all different tools. Finally, the troff [Oss82] family of Unix-based text processing tools, builds on small specialized languages for typesetting equations (eqn [KC74]), tables (tbl [Les82]), pictures (pic [Ker84]), and graphs (grap [BK86]).
Our approach builds on the ideas mentioned above by promoting the use of specialized lightweight languages as an integral part of the software development process.
The use of lightweight languages as software engineering tools bridges the semantic gap between the specification and the implementation, can offer economies of scale when implementing repetitive concepts, and can result in code that is readable, compact, easy to maintain, and concisely documents the overall structure of the application.
However, the proliferation of different languages within a project can contribute problems related to the poorly understood process, lack of tools, and scarcity of related resources. All these elements need further examination and research. We strongly believe that this research will extend the applicability of our approach and amplify the advantages brought by the use of lightweight languages in the software development process.