Tools of the Trade

Using and Abusing XML

Diomidis Spinellis

Words are like leaves; and where they most abound,

Much fruit of sense beneath is rarely found.

— Alexander Pope

I was recently gathering GPS coordinates and cell identification data, researching ~~how to obtain results~~ ~~like~~ the algorithms hiding behind Google’s “My Location” facility~~experimented with gathering GPS coordinates and phone cell identification data~~.[1] ~~researching how~~ ~~one could~~ ~~obtain results~~ ~~similar to those of~~ ~~Google’s “My Location” facility.~~ ~~//au: Do you mean “Saved Locations”~~ ~~or “My Saved Places” in Maps? I don’t see a “My Location” option~~ ~~in Google Maps~~. ~~And, do you mean~~ ~~“Google” or “Google Earth~~”?// While working on this task, I witnessed ~~once again~~ the great interoperability benefits we get from ~~the use of~~ XML. With a simple 140-line script, I ~~was able to~~ converted the data I gathered into a de -facto standard, the XML-based ~~GPS~~ GPS-exchange format called GPX. Then, using a GPS-format converter ~~of various GPS formats~~, I converted my data into Google Earth’s ~~the~~ XML data format ~~of Google Earth.~~ A few mouse clicks later, I ~~could view~~ had my journeys and ~~the~~ associated cell tower switchovers beautifully superimposed on satellite pictures and maps.

Convenient versatility //au: Feel free to revise.//

XML is an extremely nifty format. ~~A major strength of~~ Computers can easily parse XML data ~~is that computers can easily parse it~~, yet humans can also understand it. For ~~instance~~example, a week ago a UMLGraph user complained that pic2plot clipped elements from the scalable vector graphics (SVG—another XML-based format) file it generated. I was able to suggest a workaround by that modified ~~modifying~~ the picture’s bounding box, which was clearly visible as two XML tag attributes at the top of the file.

Furthermore, a simple tool can trivially determine if an XML document is well formed (meaning that it follows XML’s rules). ~~In addition~~And, if we have ~~at hand~~ the document’s schema (a formal description of a specific document’s ~~the~~ allowed composition, ~~of a specific document,~~ ~~like~~ such as GPX), we can validate that a given file follows the schema. These properties are a boon to interoperability. With the XML schema at hand, when we stumble across a data transfer problem between two applications, we don’t need to quarrel about whose program’s fault it is. A third party, an XML validator, can judge whether the data follows the schema, and ~~thereby~~ impartially assign the fault to the data’s producer or ~~the~~ consumer ~~of the data~~.

XML also ~~provides~~ gives our code ~~with~~ more robust input handling. Input processing is a notorious source of bugs, because there are literally infinite ways to provide wrong input to a program. ~~These days~~Moreover, ~~the situation is even worse~~, ~~because~~ malicious adversaries deliberately craft input data aiming to crash a ~~our~~ program, or, worse, gain and exploit its privileges. By using XML, we can solve this problem, ~~if we~~ by relying on the widely available libraries for parsing our input. These libraries are, by design and through their ubiquitous deployment, ~~a lot~~ much more ~~resilient~~ resistant to abuse than any special-purpose code we wcould concoct on our own.

Finally, by adopting XML, we can take advantage of the scores of tools that work on arbitrary XML documents. Common tasks, like editing, validation, transformations, and queries, are then just a matter of selecting and applying the right tool. Also, we can then apply the experience we gain with these tools on other documents we come across in our work. And if, like me, you’re a devoted user of the Unix toolchest, have a look at XMLgawk.[2] It manages to combine gracefully exactly what its awkward name suggests.

Best Ppractices ...

When we use XML, we sacrifice (sometimes significant) processing time and space to gain interoperability. ~~Therefore~~So, it makes sense to actually verify that we’ve achieved our goal. Once you come up with a schema, ensure that you have at least one independently written program to read and write data in that schema. ~~In a~~Additionally, have a human edit the file, and verify that its structure is unintuitive ~~//au: OK?//~~ to someone unfamiliar with the schema, and that ~~the~~ programs can still read and process the edited file. Also, formally document your schema in a schema language, such as RELAX NG or XSD (XML Schema Definition), and then have a third-party tool validate your XML files. ~~with a third-party tool.~~

Another way to promote interoperability is to adopt existing schemas. You can do that either ~~in a~~ wholesale ~~fashion~~, by having your application read and write its data in an already existing schema, for instance SVG, or piecemeal, by having parts of your XML document follow widely -adopted standards. ~~As an~~For example, the schema for GPX uses the XML Schema xsd:dateTime data type for time stamping waypoints. In turn, Tthis data type is ~~in turn~~ precisely defined by reference to ISO 8601, the international standard for date and time representations~~, ISO 8601~~. ~~Such an approach allows~~ This approach lets you to reuse large swaths of existing work, and avoids troublesome ambiguities. ~~One of the~~ A criticisms ~~against~~ of the Office Open XML file format is ~~exactly~~ that it doesn’t use existing standards for many of the elements it represents, such as (you probably guessed it) dates ~~and~~ but also math and drawings.

~~Also~~Furthermore, try to make your program’s XML output accessible to non-XML tools and humans. Specifically, if your data consists of records up to, say, 80 characters long, fit each one on a single line. This ~~allows~~ lets many ~~line~~ line-oriented- tools, like Unix’s wc, awk, sed, and grep to process your data. In more complex files, use appropriate indentation to make the file’s structure apparent to its human viewers.

... and tTar Ppits

~~Words are like leaves; and where they most abound,~~

~~Much fruit of sense beneath is rarely found.~~

~~— Alexander Pope~~

~~//au: Given our column widths, putting the quote~~ ~~here would probably be awkward.~~ ~~It might wrap and the formatting we have for these type of quotes would probably look odd as well. Could~~ we ~~move it to the top of the column?~~//

By far, the worst offence ~~I’ve seen~~ in the take-up of XML, is its adoption as a format for human-produced code. Three representative examples are the Apache aAnt ~~//au: OK?//~~ build files, the XML schema definitions (XSD), and the extensible stylesheet language transformations (XSLT). XML is an adequate, if verbose, format for data that programs produce and consume, but a nightmare for humans looking at anything more complex than what can fit on a screen. In most programming languages, tokens get a large part of their meaning from their context ~~in which they appear~~. For instance, a word appearing on the left of an open bracket is a function or method name. Contrast this with XML, where each token is explicitly assigned its meaning through tags and attributes. For example, in a make file, we can associate a value with a variable by writing

TESTSRC=test/src

Placement on one side or the other of We the equals sign distinguishes the variable from its value ~~by their place around the equals sign~~. In the corresponding XML-based aAnt build file, we write the equivalent as

In this case, ~~we use~~ named attributes to specify what’s is assigned to what.

This XML’s approach simplifies the parsing of arbitrary files, but the corresponding verbosity hinders comprehension and comfortable programming.

In computer languages, there’s ~~seems to be~~ a sweet spot between conciseness and wordiness. ~~This spot is a~~Apparently, ~~the~~ it’s the place where the means for expressing an idea matches our cognitive ability. Languages occupying this spot ~~seem to be~~ are the ones in which we achieve long-term productivity (this includes maintenance). Some languages or programming styles, like APL and Perl one-liners, have strayed to extreme conciseness. Other languages, like Cobol and XML, err toward excessive wordiness. Both extremes hinder the software’s analyzability, changeability, and stability and, therefore, its maintainability. Even with the best editor, expressing ~~oneself~~ yourself in XML is a lot less productive than coding the same ideas in a notation specifically designed for a given problem. ~~To convince yourself t~~Try rewriting a simple make file into its aAnt XML-equivalent. ~~Therefore~~So, if humans will typically communicate with your software using a language, invest some effort in designing it properly, rather than relying on the bland (dis)comfort of XML.

Another popular misuse of XML involves ~~the~~ thin -wrapping of arbitrary data with XML tags. Because XML is so flexible, it’s easy to take any data format, throw in a few tags in the most convenient places, and (following the letter of the XML definition) call that an XML document. Yet, such documents are difficult to process effectively with standard XML tools.; Ttheir validation is a charade, and transformations and queries become all but impossible. ~~Consider, f~~For instance, consider the XML file format used for storing iTunes libraries. Its generation apparently takes the shortcut of converting Apple’s Core Foundation types into a so-called property list, which ~~has the outward appearance of~~ looks like XML on the outside. Yet the contents of such files are key/value pairs, ~~like~~ such as the following:.

<key>Name</key><string>Audiobooks</string>

<key>Playlist ID</key><integer>94</integer>

In a better, tailor-designed, XML file format, we’ ‘~~woul~~d expect ~~the above~~ this pair to be something like

<name id=”94”>Audiobooks</name>

A similarly dysfunctional XML file will result if we dump a relational database in XML as columns, rows, and tables. Again, we miss the opportunity to express in XML the deeper relationships between our

records, which is really ~~the strength of~~ XML’s strength.

~~Therefore~~So, when you’re designing an XML document, place yourself in the mindset of its consumer. Think, what’s the best possible structure you would expect? Then invest in mapping your data into the schema you’ve designed.

~~Diomidis Spinellis~~ ~~is an associate professor in the Department of Management Science and Technology at the Athens University of Economics and Business and the author of~~ ~~Code Quality: The Open Source Perspective~~ ~~(Addison-Wesley, 2006).~~ ~~Contact him at dds@aueb.gr.~~

~~Tools of the Trade~~

~~Using and Abusing XML~~

~~Diomidis Spinellis~~

~~XML has many strengths: computers and humans can both process it, special tools can validate it, and it promotes robust input handling.~~ ~~To achieve interoperability, we should formally define schemas (adopting existing ones, when possible), and test XML data with different producers and consumers.~~ ~~Formatting the data in a way that is accessible to both human readers and popular software tools is also a good practice.~~ ~~XML is also easily misused. Its adoption as a format for human-produced code, and the thin wrapping of arbitrary data with XML tags are two popular offences.~~

~~keywords:~~ ~~//au: Please supply keywords.//~~

[2] http://home.vrweb.de/~juergen.kahrs/gawk/XML/