http://www.spinellis.gr/pubs/trade/2006-EOSJ-BadCode/html/10tips.htm This is an HTML rendering of a working paper draft that led to a publication. The publication should always be cited in preference to this draft using the following reference:
|
10 Tips for Spotting Low-Quality Open Source Code
By Diomidis
Spinellis
From the
vast fields of Open Source Software (OSS) projects, we can gather both wheat
and chaff. Distinguishing between the two can yield a significant edge in using
OSS: We can avoid false starts and back champions with long-term prospects of
success. For large, highly visible projects, we can use as our guide public
reviews, comments in forums and blogs, and the opinion of colleagues who have
used them. However, when we need to choose a project from the remarkably long
trail of specialized OSS code that fits every conceivable need, these resources
simply may not be available. In such cases, OSS lets us browse the underlying
source code and reach our own opinion on what we’re dealing with.
Here’s a
list of 10 clues for spotting bad code:
Writing
good comments is more difficult than writing good code, so identifying code
that has poor commenting is child’s play. If you see non-trivial functions or
methods lacking a comment at their beginning, explaining what they do, you know
you’re in trouble. The same goes for global variables, class and structure
fields, and code blocks implementing a complex algorithm: All should appear
with a comment. Not everything requires a comment; straightforward code and
many local variables are better left to explain themselves.
Some
comments are useless because they simply repeat what’s obvious from the name of
the entity they explain. A particular thorn here are elaborate javadoc comments
that some sophisticated editors automatically create and some unsophisticated
programmers never fill-in. Such comments consume valuable screen real estate
and distract the code’s reader without contributing something to the program’s
understanding.
Excessive
commenting is also a problem. As programmers maintain the code, they often
forget to update the comments (especially when there are many detailed comments
and they’re unsure about what they’re doing), and this causes the comments to
diverge from reality. So, unless a program is written in assembly language, a
comment on every line of code can be just as troublesome as too few comments.
Problems
with the naming of the program’s entities are also easy to spot. One area is
inconsistency. When you see in the same program CamelCaseIdentifiers mixed with
underscore_separated_names, you know
you’re treading on the remains of a battle between different
personalities. Similarly, if you find a wanderGet and a setWander method, you
know the programmer who wrote the code probably had his mind somewhere else at the time.
When
choosing names, moderation should be the norm. A long, overly descriptive name
used for a local index variable that should be named “i” is just as bad as a
short, cryptic global variable name.
Another
sign of bad code is inappropriate use of spaces for indenting the code and
separating the program’s elements. There are many different ways for indenting
the code to indicate its structure and no excuses for not using them, for
inventing a new one, or for combining two in the same file. Any of those three
sins should alert you that the code is sick and the problems could go deeper
than the surface.
The same
applies to use of spaces for forming statements and separating operands from
operators in expressions. Novice programmers often fail to notice specific
spacing rules for programs, so watch out for those spaces.
You also
can spot bad code in calls to library functions that fail to an error return.
In C, code errors are often passed from the Application Program Interface (API)
to the program in-band (together with the function’s result). A special value, such as NULL or a
negative integer, is used to indicate the result isn’t an entity the caller
would expect to be returned, but an indication of an error. If the program
fails to check the return value of, say, open or malloc, for NULL, then you can
be almost certain there will be situations where the program will crash. In the
Java and Microsoft .NET platforms, errors are typically communicated out of
band through exceptions. In these cases, dodging an error situation requires
explicit effort: an empty exception handler.
Handling
errors is difficult because the actions you can take after the error occurs
rarely rectify the error. So programmers must check one possible error after
another merely to report it and exit gracefully. Programs that don’t check for
API errors are often untrustworthy.
Copying
code from one place to another is easy to do, but finding that code after many
years and many programmers have left their signs on the copies is rarely that
straightforward. We typically encounter the situation of repeated code
sequences when we fix an error or introduce a change, then look in the program
for further places we should adjust (this is always a good practice).
Sometimes, we locate another subtly different instance of the code we modified,
then another, and another. These code segments violate the “Don’t Repeat
Yourself” (DRY) and “single point control” principles.
Although
repetition is rarely glaringly apparent, the trouble it causes goes deeper than
the surface. Repetition indicates a failure to use appropriate abstraction
mechanisms such as methods, functions, subclasses, and generic types. So the
code is longer than what is desirable, and will contain more bugs. Moreover,
repeated instances of code hinder maintenance because they force you to
manually track and modify each separate repeating fragment.
Despite
what competitions such as the Obfuscated C Code Contest might let one believe,
writing complex, undecipherable code isn’t a sign of programmer prowess. We
can’t clearly express all algorithms. However, deliberately obtuse code is an
insult to developers who need to maintain it in the future. When you see code
that seems too complex, try to simplify it, starting with expression or
statement-level refactoring operations. If you succeed, you’ll know the code’s
quality isn’t all that great, and you’ll leave the code better than you found
it.
The
mother of unclear and complex code is code that expresses a complicated design.
Here, look for mechanisms such as inheritance, interfaces, and generic types
that are used with no apparent gains. Design patterns also are sometimes
misused in a similar manner.
There’s
a difference between a complex design that serves a purpose and one that simply
shows off designer expertise or guards against vague, unspecified requirements
that might appear in the future. The agile programming community has taught us
that paying upfront the cost of future-proof gold plating is a disastrous waste
of programmer productivity. The design becomes difficult to navigate today
while the rewards may or may not be reaped in the future. So, when you
encounter complex designs that fit this description, don’t feel intimidated;
simply lament the fact that somebody’s paranoiac design ideas will cause you
some everyday pain.
At the
opposite end of complex designs, you may find programs that use too few
abstractions, hiding several assumptions deep inside the code. Two chief and
easily recognizable culprits in this area are fixed buffer sizes and magic
numbers. Although nowadays, C++, Java, and .NET come with a feature-rich
container framework, one can still find code that declares a fixed-size array
instead of using the platform’s vector container, which implements an
expandable array of objects. The number of elements the array will hold is
hidden within the code, and the reasons for choosing that particular number are
often similarly hidden in the programmer’s head. This code will crash when it
faces more elements than those originally envisioned, and, sometimes, it may
even expose the program to a buffer overflow attack.
Numbers
appearing out of the blue don’t only occur in array declarations. They come in
many shapes and sizes, and they’re aptly termed magic numbers. We can’t avoid
them, because we do need, for example, to specify that Transmission Control
Protocol (TCP) packets are identified by the number 6, and User Datagram
Protocol (UDP) packets by the number 17. However, directly embedding these
numbers in the code makes the code difficult to understand and maintain.
Using an
array instead of a vector may be bad, but implementing a resizable collection
from scratch is worse. Be wary when you find code implementing functionality
that duplicates facilities available in the standard libraries. Although
standard libraries are typically mature, stress-tested, and well-thought-out,
homegrown code duplicating their functionality is anything but. When you find a
complete XML parser or a specially crafted version of a sort function in a
program, you’ll have to spend effort to understand its interface, verify its
functionality, and ensure its performance is adequate. You’ll endure all that
because somebody wanted to re-invent the wheel or didn’t bother to read the
platform’s documentation.
Unit
tests are becoming standard, but don’t assume that any code lacking unit tests
is deficient. Other methods of testing also are acceptable, especially for
legacy code. For example, the liberal use of assertions in the code, or the
provision of a framework for regression testing, can often provide a solid
testing infrastructure. Also, programs implemented as filters or programs
providing a scriptable interface also can be easy to test through an external
harness. On the other hand, when it’s obvious that no one ever considered how
the code will be tested, and the code defies reasonable attempts to test it,
then you have a problem.
Many of
these bad code traits may seem trivial, and many are. Some can even be superficially
corrected with a code formatting tool such as indent. However, these traits are
simply danger signs. If those who wrote the code weren’t motivated enough or
didn’t pay enough attention to keep the simple things in order, you can be sure
that many worse issues may be lurking.
Having
identified substandard code in a project we’d like to use, we need to take a
step back and evaluate the situation. Is there perhaps an alternative project,
written to a higher standard, that we can adopt? If not, we have to establish
some risk mitigation measures: plan for more development time, expect bugs to
emerge, and anticipate that small changes may require significant rework.
Ideally, we also should arrange for an opportunity to give the code a facelift
to fix the worst-designed elements and correct style problems. If possible, we
should coordinate our code improvement work with the project’s managers,
contributing our share to the community from which we benefited.
Diomidis
Spinellis is an associate professor in the Department of Management Science and
Technology at the Athens University of Economics and Business. He’s the author
of two open source perspective books: Code
Reading (Software Development Productivity Award 2004), and the recent Code Quality (Addison-Wesley, 2006).
He’s currently heading the research project “Source Quality Observatory for
Open Source Software (SQO-OSS)”, which is partially funded by the European
Community’s Sixth Framework Programme.
e-Mail:
dds@aueb.gr
Editor’s Note: Some of this material appeared on informit.com in “The Bad Code
Spotter's Guide.”