http://www.spinellis.gr/pubs/jrnl/1993-JCLT-CType/html/tsl.html This is an HTML rendering of a working paper draft that led to a publication. The publication should always be cited in preference to this draft using the following reference:
|
Diomidis Spinellis
Myrsinis 1
GR-145 2 Kifissia
Greece
dspin@leon.nrcps.ariadne-t.gr
In a separate compilation environment type checks across source files are difficult to implement, because the natural place to perform them, the linker, is rarely under the control of the compiler developer. This is typically handled either by programs that perform a global declaration and use consistency check, or by relying on the programmer to create and use a consistent set of headers. We present a solution to this problem based on encoding the identifiers with their types. Each function or variable identifier has an encoding of its type appended to its name. In this way type mismatches are caught at link time as undefined references. Multiple definitions of the same identifier with different types are handled by creating dummy variables.
As we mentioned in the introduction, the last part of a C program translation can (and often has to) be performed by the system linker. The problem with this approach is that conflicting declarations can not be checked across different modules. Consider the following two source files:
/* File 1 */ int f(int i) \{ return i * 2; \} /* File 2 */ #include <stdio.h> int f(void); void g(void) \{ printf("%d\esc{}n", f()); \}
In our system, both files compile[+] and link without any warning or error messages. The two files contain declarations[+] for the same function (f) that do not have compatible types. Specifically, in file 1 f is defined as int f(int) whereas in file 2 it is declared as int f(void). According to ANSI C (§3.1.2.6) the behaviour of a program containing declarations that refer to the same object or function that do not have compatible types is undefined (in our system g prints 2 when called from main). The compiler is not violating the ANSI standard, since ``ignoring the situation with unpredictable results'' is permissible undefined behaviour (§1.6). However, a higher quality implementation should issue of a suitable diagnostic statement (such as ``conflicting declarations for function f in file 1, file 2'').
f: variable # of args. f1.c(5) :: f2.c(8)The solution is not very efficient as typically lint needs to examine all modules in order to determine whether the functions exported are used in a consistent way. It also depends on cooperation from the programmer in order to run lint and keep the lint libraries up to date.
/* File 1 */ #include "defs.h" int f(int i) \{ return i * 2; \} /* File 2 */ #include <stdio.h> #include "defs.h" void g(void) \{ printf("%d\esc{}n", f()); \} /* defs.h */ int f(void);Compiling the two functions in our system generates the following error message:
f1.c:5: conflicting types for `f' defs.h:1: previous declaration of `f'
The problem with this approach is the need to keep the headers consistent with both the source they refer to, and the compiled objects that depend on them. Special tools such as mkdep [MSD90, mkdep(1)] can be used in conjunction with make [Fel79] to achieve the desired effect. Their effectiveness depends on the cooperation, concentration and organisational powers of the programmer.
Function name encoding is a technique where identifiers are encoded with auxiliary information signifying their type. This information is used at link time in order to detect declaration and definition inconsistencies. The scheme was first proposed by [Ham76], and in connection with the C++ language in [ES90, p. 121-127]. It was also reportedly used by the the DTSS PL/I Linker developed at Dartmouth College in about 1975 []. The method presented in [ES90] works by appending a double underscore followed by character encodings of the function argument types after the function name. Basic types are encoded as single characters (listed in table 1), type modifiers and declarators are encoded by prepending character codes (listed in table 2) to the basic type and user defined classes are encoded using their identifier name prefixed by its length.
Type | Character Code |
char | c |
double | d |
float | f |
int | i |
long | l |
long double | r |
short | s |
void | v |
... | e |
Modifier | Character Code |
array of size n | An |
volatile | V |
const | C |
function | F |
pointer | P |
reference | R |
signed | S |
unsigned | U |
For example the function:
double a(double b, int c) \{ ... \}
would be encoded as a__Fdi where F stands for function, d for double and i for int. If the function was declared in another translation unit with a different declaration, then the two encoded function identifiers would not resolve at link time generating an ``undefined symbol'' error message. The description n [ES90] also contains encodings for the C++ operator functions, ways to minimise the length of the encodings, and a hashing proposal to deal with linkers with short identifier length limits.
The advantages of this approach, as given in [ES90, p. 122], are:
The scheme does not encode types of variables and return types of functions. This is necessary in order to ensure that errors arising from declaring a variable or function in two different modules with the same name, but different type or return type correspondingly, are caught by the linker. (Defining the same function with different argument types in separate modules is allowed in order to provide for C++ function overloading).
This scheme handles, in general, checks 1 and 3, and check 2 for function arguments. If the scheme was naively extended to handle global variables and function return types it would perform checks 1 and 2, but not check 3 thus altering the semantics of the language. For example the following which is not a correct program (§3.5) would link without a problem:
/* File 1 */ int i; /* File 2 */ double i;
The naively extended scheme would encode the variable in file a as i__i and the variable in file b as i__d. The type clash would not be detected at link time as the two variables would end having different names.
The authors suggest [ES90, p. 123] that handling all inconsistencies would require either linker support or a mechanism allowing the compiler to access information from separate compilations. A solution to this problem without a need for linker modification is presented in [Spi91]. In the following section we discuss how this solution can be applied for declaration checking of C programs.
- 1.
- Every function with external linkage has its return type encoded on its name by appending to the function name the return type encoding before the parameter encodings. In addition for every function, a dummy object definition named after the encoded function name is inserted into the object file.
- 2.
- Every object with external linkage has its type encoded into its name by appending an uppercase V followed by the type encoding to the object name. In addition, for every object with external linkage, a dummy object definition with the same name as the original object is inserted into the object file.
This scheme is essentially the naive extension presented in the previous section with an additional check in the form of dummy variables in order to handle the check in case 3. Objects with external linkage and conflicting definition and declaration/use will appear on the linker error list as unresolved references of the encoded names; object with external linkage and conflicting definitions will appear on the linker error list as multiple definitions of the un-encoded names.
In the following two examples we present sample programs with the corresponding encodings for the two cases:
/* File 1 (un-encoded) */ int a; /* File 2 (un-encoded) */ extern double a; void f(void) \{ a = 3.14; \}These files can be transformed into an equivalent source form with type encoding as follows:
/* File 1 (encoded) */ int a__Vi; char a = 1; /* Dummy variable */ /* File 2 (encoded) */ extern double a__Vd; void f__Fvv(void) \{ a__Vd = 3.14; \}
The linkage of the two files produces the following error on our system:
f2.o: undefined reference to `_a__Vd' f2.o: undefined reference to `_a__Vd'
/* File 1 (un-encoded) */ int a; /* File 2 (un-encoded) */ double a; void f(void) \{\}
These files can be transformed into an equivalent source form with type encoding as follows:
/* File 1 (encoded) */ int a__Vi; char a = 1; /* Dummy variable */ /* File 2 (encoded) */ double a__Vd; char a = 1; /* Dummy variable */ void f__Fvv(void) \{\}
The linkage of the two files produces the following error on our system:
f2.o: multiple definition of `_a (.data)' f1.o: first seen here
Type | Character Code |
void | v |
char | c |
signed char | b |
unsigned char | a |
short, signed short, signed short int, short int | s |
unsigned short, unsigned short int, | t |
int, signed, signed int | i |
unsigned, unsigned int | u |
long, signed long, long int, signed long int | l |
unsigned long, unsigned long int | m |
float | f |
double | d |
long double | r |
... | e |
Other types and qualifiers are encoded using the following rules:
struct symbol \{ int len; char *name; \} s;
The system described in the previous sections has not yet been implemented. We have tried the various encodings to verify the soundness of the approach. One possible implementation approach we are currently considering would in the form of a source to source transformation filter that would run before the compiler proper. This could be used to retrofit recalcitrant compilers with advanced inter-module type checking. In the following paragraphs we will examine some other implementation details.
The easiest way to handle the dummy variables is to define them as character objects. This is straightforward to implement as a source to source transformation. However, these variables may take up some data space in the final executable image. A more elegant solution that uses no space, is to define the variables as global constants at the linker level. This can also be implemented in a source to source transformation if the compiler supports constants at linker level.
A ``feature'' often found in linkers is that of allowing more than a single definition for an object. This is often used to implement FORTRAN common blocks [Ros84, §5.8]. It is also used by C implementations to allow a number of tentative definitions in different modules [KR88, p. 227] (§3.7.2). The system proposed in this article depends on the linker detecting multiple definitions. This can be enforced on most linkers by supplying an initialiser to all dummy definitions used in the proposed scheme, thus removing their tentative quality. Some linkers however, silently ignore multiple definitions of the same object even when initialiser is supplied. In that case it may be impossible to implement the scheme.
An implementation of this scheme must take into account that supporting software needs to be made aware of the encoding scheme. For example, debuggers, profilers and other utilities that access the compiler generated object and executable code may need to be modified. In addition, the output of the linker could be filtered to convert the error messages to more meaningful ones.
Finally, the implementor of this approach must keep in mind the identifier length restrictions of the target linker. If these are severe (ANSI allows linkers with a 6 character identifier length restriction), then some form of hashing for longer identifiers has to be considered.
Diomidis D. Spinellis is currently completing his Ph.D. on multiparadigm programming at Imperial College of Science, Technology and Medicine (University of London). Mr. Spinellis has worked as a contract programmer in the areas of CAD, Unix administration, multimedia, and product localisation for the Greek market. In relation to the C language he is best not known for his three winning entries in the International Obfuscated C Code Contest.