http://www.spinellis.gr/pubs/jrnl/2005-IEEESW-TotT/html/v25n2.html
This is an HTML rendering of a working paper draft that led to a publication. The publication should always be cited in preference to this draft using the following reference:

Citation(s): 2 (selected).

This document is also available in PDF format.

The document's metadata is available in BibTeX format.

Find the publication on Google Scholar

This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.

Diomidis Spinellis Publications


© 2008 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.

Tools of the Trade

Using and Abusing XML

Diomidis Spinellis

 

 

Words are like leaves; and where they most abound,

 Much fruit of sense beneath is rarely found.

— Alexander Pope

 

I was recently gathering GPS coordinates and cell identification data, researching how to obtain results like the algorithms hiding behind Google’s My Location” facilityexperimented with gathering GPS coordinates and phone cell identification data.[1]  researching how one could obtain results similar to those of Google’s “My Location” facility.  //au:  Do you mean “Saved Locations” or “My Saved Places” in Maps? I don’t see a “My Location” option in Google Maps. And, do you mean “Google” or “Google Earth”?// While working on this task, I witnessed once again the great interoperability benefits we get from the use of XML.   With a simple 140-line script, I was able to converted the data I gathered into a de -facto standard, the XML-based GPS GPS-exchange format called GPX.   Then, using a GPS-format converter of various GPS formats, I converted my data into Google Earth’s the XML data format of Google Earth.   A few mouse clicks later, I could view had my journeys and the associated cell tower switchovers beautifully superimposed on satellite pictures and maps.

Convenient versatility //au:  Feel free to revise.//

 

XML is an extremely nifty format.   A major strength of Computers can easily parse XML data is that computers can easily parse it, yet humans can also understand it.   For instanceexample, a week ago a UMLGraph user complained that pic2plot clipped elements from the scalable vector graphics (SVG—another XML-based format) file it generated.   I was able to suggest a workaround by that modified modifying the picture’s bounding box, which was clearly visible as two XML tag attributes at the top of the file.

 

Furthermore, a simple tool can trivially determine if an XML document is well formed (meaning that it follows XML’s rules).   In additionAnd, if we have at hand the document’s schema (a formal description of a specific document’s the allowed composition, of a specific document, like such as GPX), we can validate that a given file follows the schema.   These properties are a boon to interoperability.   With the XML schema at hand, when we stumble across a data transfer problem between two applications, we don’t need to quarrel about whose program’s fault it is.   A third party, an XML validator, can judge whether the data follows the schema, and thereby impartially assign the fault to the data’s producer or the consumer of the data.

 

XML also provides gives our code with more robust input handling.   Input processing is a notorious source of bugs, because there are literally infinite ways to provide wrong input to a program.   These daysMoreover, the situation is even worse, because malicious adversaries deliberately craft input data aiming to crash a our program, or, worse, gain and exploit its privileges.   By using XML, we can solve this problem, if we by relying on the widely available libraries for parsing our input.   These libraries are, by design and through their ubiquitous deployment, a lot much more resilient resistant to abuse than any special-purpose code we wcould concoct on our own.

 

Finally, by adopting XML, we can take advantage of the scores of tools that work on arbitrary XML documents. Common tasks, like editing, validation, transformations, and queries, are then just a matter of selecting and applying the right tool. Also, we can then apply the experience we gain with these tools on other documents we come across in our work. And if, like me, you’re a devoted user of the Unix toolchest, have a look at XMLgawk.[2] It manages to combine gracefully exactly what its awkward name suggests.

Best Ppractices ...

When we use XML, we sacrifice (sometimes significant) processing time and space to gain interoperability.   ThereforeSo, it makes sense to actually verify that we’ve achieved our goal.   Once you come up with a schema, ensure that you have at least one independently written program to read and write data in that schema.   In aAdditionally, have a human edit the file, and verify that its structure is unintuitive //au: OK?// to someone unfamiliar with the schema, and that the programs can still read and process the edited file.   Also, formally document your schema in a schema language, such as RELAX NG or XSD (XML Schema Definition), and then have a third-party tool validate your XML files. with a third-party tool.

 

Another way to promote interoperability is to adopt existing schemas.   You can do that either in a wholesale fashion, by having your application read and write its data in an already existing schema, for instance SVG, or piecemeal, by having parts of your XML document follow widely -adopted standards.   As anFor example, the schema for GPX uses the XML Schema xsd:dateTime data type for time stamping waypoints.   In turn, Tthis data type is in turn precisely defined by reference to ISO 8601, the international standard for date and time representations, ISO 8601.   Such an approach allows This approach lets you to reuse large swaths of existing work, and avoids troublesome ambiguities.   One of the A criticisms against of the Office Open XML file format is exactly that it   doesn’t use existing standards for many of the elements it represents, such as (you probably guessed it) dates and but also math and drawings.

 

AlsoFurthermore, try to make your program’s XML output accessible to non-XML tools and humans.   Specifically, if your data consists of records up to, say, 80 characters long, fit each one on a single line.   This allows lets many line line-oriented- tools, like Unix’s wc, awk, sed, and grep to process your data.   In more complex files, use appropriate indentation to make the file’s structure apparent to its human viewers.

... and tTar Ppits

Words are like leaves; and where they most abound,

Much fruit of sense beneath is rarely found.

— Alexander Pope

//au: Given our column widths, putting the quote here would probably be awkward. It might wrap and the formatting we have for these type of quotes would probably look odd as well. Could we move it to the top of the column?//

 

By far, the worst offence I’ve seen in the take-up of XML, is its adoption as a format for human-produced code.   Three representative examples are the Apache aAnt //au: OK?// build files, the XML schema definitions (XSD), and the extensible stylesheet language transformations (XSLT).   XML is an adequate, if verbose, format for data that programs produce and consume, but a nightmare for humans looking at anything more complex than what can fit on a screen.   In most programming languages, tokens get a large part of their meaning from their context in which they appear.   For instance, a word appearing on the left of an open bracket is a function or method name.   Contrast this with XML, where each token is explicitly assigned its meaning through tags and attributes.   For example, in a make file, we can associate a value with a variable by writing

 

TESTSRC=test/src

 

Placement on one side or the other of We the equals sign distinguishes the variable from its value by their place around the equals sign.   In the corresponding XML-based aAnt build file, we write the equivalent as

 

<property name="testsrc" location="test/src"/>

 

In this case, we use named attributes to specify what’s is assigned to what.

 

This XML’s approach simplifies the parsing of arbitrary files, but the corresponding verbosity hinders comprehension and comfortable programming.   

In computer languages, there’s seems to be a sweet spot between conciseness and wordiness.   This spot is aApparently, the it’s the place where the means for expressing an idea matches our cognitive ability.   Languages occupying this spot seem to be are the ones in which we achieve long-term productivity (this includes maintenance).   Some languages or programming styles, like APL and Perl one-liners, have strayed to extreme conciseness.   Other languages, like Cobol and XML, err toward excessive wordiness.   Both extremes hinder the software’s analyzability, changeability, and stability and, therefore, its maintainability.   Even with the best editor, expressing oneself yourself in XML is a lot less productive than coding the same ideas in a notation specifically designed for a given problem.   To convince yourself tTry rewriting a simple make file into its aAnt XML-equivalent.   ThereforeSo, if humans will typically communicate with your software using a language, invest some effort in designing it properly, rather than relying on the bland (dis)comfort of XML.

 

Another popular misuse of XML involves the thin -wrapping of arbitrary data with XML tags.   Because XML is so flexible, it’s easy to take any data format, throw in a few tags in the most convenient places, and (following the letter of the XML definition) call that an XML document.   Yet, such documents are difficult to process effectively with standard XML tools.; Ttheir validation is a charade, and transformations and queries become all but impossible.   Consider, fFor instance, consider the XML file format used for storing iTunes libraries.   Its generation apparently takes the shortcut of converting Apple’s Core Foundation types into a so-called property list, which has the outward appearance of looks like XML on the outside.   Yet the contents of such files are key/value pairs, like such as the following:.

 

<key>Name</key><string>Audiobooks</string>

<key>Playlist ID</key><integer>94</integer>

 

In a better, tailor-designed, XML file format, we would expect the above this pair to be something like

 

<name id=”94”>Audiobooks</name>

 

A similarly dysfunctional XML file will result if we dump a relational database in XML as columns, rows, and tables.   Again, we miss the opportunity to express in XML the deeper relationships between our

 records, which is really the strength of XML’s strength.    

ThereforeSo, when you’re designing an XML document, place yourself in the mindset of its consumer.   Think, what’s the best possible structure you would expect?   Then invest in mapping your data into the schema you’ve designed.

 

Diomidis Spinellis is an associate professor in the Department of Management Science and Technology at the Athens University of Economics and Business and the author of Code Quality: The Open Source Perspective (Addison-Wesley, 2006).   Contact him at dds@aueb.gr.


Tools of the Trade

Using and Abusing XML

Diomidis Spinellis

XML has many strengths: computers and humans can both process it, special tools can validate it, and it promotes robust input handling.   To achieve interoperability, we should formally define schemas (adopting existing ones, when possible), and test XML data with different producers and consumers.   Formatting the data in a way that is accessible to both human readers and popular software tools is also a good practice.   XML is also easily misused. Its adoption as a format for human-produced code, and the thin wrapping of arbitrary data with XML tags are two popular offences.

 

keywords:  //au:  Please supply keywords.//



[1] http://www.google.com/gmm/mylocation.html

[2] http://home.vrweb.de/~juergen.kahrs/gawk/XML/