http://www.spinellis.gr/pubs/jrnl/2003-CACM-URLcite/html/urlcite.html
This is an HTML rendering of a working paper draft that led to a publication. The publication should always be cited in preference to this draft using the following reference:

Diomidis Spinellis. The decay and failures of web references. Communications of the ACM, 46(1):71–77, January 2003. (doi:10.1145/602421.602422)

Citation(s): 183 (Google Scholar), 95 (selected).

This document is also available in PDF format.

The document's metadata is available in BibTeX format.

Find the publication on Google Scholar

This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.

Diomidis Spinellis Publications

© ACM, 2003. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in Communications of the ACM, 46(1):71-77, ISSN 001-0782, (January 2003). http://doi.acm.org/10.1145/602421.602422

The Decay and Failures of Web References

Diomidis Spinellis
Department of Management Science and Technology
Athens University of Economics and Business
dds@aueb.gr

Abstract

Increasingly, scholarly publications contain references to material on the Web. We examine the accessibility and decay rate of these references by extracting and inspecting 4224 ULR references from 2471 computer science articles that appeared over the last five years. Of those URLs 27% were not accessible, while close to 50% of them became inaccessible 4 years from the date they were published. In addition, we found that deep URL path hierarchies are linked to a larger number of failures; educational and research material on the Web is referenced three times more than its population representation; pages hosted by educational and commercial sites are equally probable to deteriorate. In the short term none of the proposed approaches for dealing with the problem is likely to be a panacea.

Keywords

Web, citations, URL, decay, 404

1 Introduction

The widespread adoption of the World-Wide-Web as a mechanism for sharing information has brought with it the corresponding ubiquity of Web (URL) references. Nowadays URLs regularly appear on billboards, packages, business cards, print adverts, film trailers, clothes, and, as references in scientific articles. Most readers have probably experienced a ``dead link'': a Web reference that for a variety of reasons will not lead to a valid or correct Web page. A dead link stemming from a URL that appears in the context of everyday life is usually a minor inconvenience that can be alleviated with the use of a Web index or a search engine; it will seriously affect only the future archeologists trying to untangle the web of our daily lives. On the other hand, a dead Web link that appears on a scientific article has wider implications. Citations in scholarly work are used to build upon existing work, substantiate claims, provide the context in which research is performed, and present, analyze, and compare different approaches or methodologies. Therefore, references that can not be located, seriously undermine the foundations of modern scientific discourse.

The objective of this article is to examine, quantify, and characterize the quantity and quality of Web links used in the computing literature. Our aim is to provide concrete figures related to the availability of URL references as a function of their age, their domain, the depth of the path used, as well as the technical reasons leading to failed links.

Our research has been greatly aided by the emergence of online versions of traditional paper-based publications [4]. By tapping into the online libraries of the ACM and the IEEE Computer Society we were able to download, extract, and verify 4375 Web links appearing in print articles over the period 1995-1999. In the following paragraphs, we describe the technologies related to Web references and retrieval, outline the methodology we followed, present the obtained results, and discuss their implications.

2 Technology Overview

Internet resources are typically specified using the string representation of ``Uniform Resource Locators'' (URLs). URLs are a subset of the Uniform Resource Identifiers (URIs) that provide an abstract identification of a resource location [3]. In general, URLs consist of a scheme (e.g. http, ftp, mailto) followed by a colon and a scheme-specific part. The syntax of the scheme-specific part can vary according to the scheme. However, URL schemes that involve direct use of an IP-based protocol to an Internet host use the following common syntax:

//<user>:<password>@<host>:<port>/<url-path>

The double slash indicates that the scheme data complies with the Internet scheme syntax. In our sample, over 98% of the URLs we encountered used the HTTP URL scheme which is used to designate Internet resources accessible using HTTP (HyperText Transfer Protocol). The host is specified using the fully qualified domain name of a network host or its IP address. The default port for the HTTP scheme is 80 and is usually omitted.

For a Web page of a given URL to appear on a browser's screen a number of different technologies and protocols must work in concert. In addition, the changing realities of the Web and the Internet have vastly complicated the simple end-to-end request-reply protocol that used to form the basis of the early HTTP transactions. Any failure along the complicated chain of actions needed to retrieve a Web page will lead to a failed URL reference.

When a Web page is accessed for the first time, the name of the host must be resolved into a valid IP address. Although before the emergence of the Web there was typically a one-to-one correspondence between the IP address of non-routing hosts and a domain name the situation now is more complicated. Many hosts respond to different IP addresses, associating a different Web site with each address (virtual IP-based hosting). As IP addresses are becoming a scarce resource it is also common to associate many domain names for the same host and IP address and serve different Web sites based on the host name used for a given page request (virtual name-based hosting). Finally, Web sites receiving a lot of traffic may associate different IP addresses and hosts for the same domain name in order to distribute traffic among hosts.

The path appearing in a URL will nowadays not necessarily match with a corresponding local file on the server. Web servers provide a number of mechanisms for managing namespaces. Some of them are: the creation of a separate namespace for every local user, the definition of protection domains and access mechanisms, the support of aliases to map namespaces to local directories, and the dynamic creation of content using technologies such the common gateway interface (CGI) and active server pages (ASP). In addition, a feature of the HTTP protocol called content negotiation allows a server to provide different pages based on technical or cultural characteristics of the incoming request (e.g. bandwidth, display technology, languages the user can understand).

One final complication results from the fact that Web transactions seldom now follow the idealized IP path from the user host to the server. Both ends are likely to be protected by firewalls actively blocking or even modifying content that passes through them. At the user-end routers utilizing network address translation (NAT) mechanisms as a way to manage a limited pool of IP addresses are likely to hide the IP address of the end-host from the server. Finally, proxy servers-working either in cooperation with the end-user or transparently intercepting requests-will cache documents and serve them without communicating with the original server.

Any failure along the complex path we described will often result in a failed request for a URL. The HTTP protocol defines 24 different errors that can occur within an HTTP exchange. In addition, some errors can occur before the client and server get a chance to communicate. In practice, while verifying thousands of URLs we encountered the following errors:

400 Bad request: The syntax used for the request could not be understood by the server. In our case this error typically signifies badly written URLs.
401 Unauthorized: The request requires user authentication. Such an error can result when citations are given to URLs that exist within a domain of services that require registration (e.g. the ACM Digital Library), or when such services move from a free access to a registration-based model. It is debatable whether this return code classifies an access as a failure. A number of digital library services are increasingly provided over the Web on a subscription basis; lacking authorization to access such a service is similar to material not being available in the local library.
403 Forbidden: The server is refusing to fulfill the given request, in this case however proper authorization can not be used to retrieve the page. It is conceivable that URLs that are not part of the public Internet end up as citations when the authors fail to realize that they have special privileges to access certain repositories that do not apply to the global Internet population. As an example, our organization has transparent access to a collection of on-line journals with authentication based on the client IP address. URLs to this collection provided by unsuspecting users will typically generate a 403 error.
404 Not Found: This infamous and quite common response signifies that the server has not found anything matching the Request-URI. This error is typically generated when Web site maintainers change file names that are part of the given URL path or entirely remove the referenced material. Note that this protocol error can be followed by customized content-typically HTML text that informs the user of the problem and provides alternative navigation options.
500 Internal Server Error: The server encountered an unexpected condition which prevented it from fulfilling the request. This error can occur when a server is wrongly configured, or, more commonly, if a program or database that is used to serve dynamic content fails.
503 Service Unavailable: The server is currently unable to handle the request due to temporary overloading or maintenance of the server. Errors of this type sometimes appear on a misconfigured server, or servers overwhelmed by traffic.
504 Gateway Time-out: The server, acting as a proxy or gateway did not receive a timely response from the upstream server specified by the URI (e.g. HTTP, FTP) or some other auxiliary server (e.g. domain name server-DNS) it needed to access in attempting to complete the request. When HTTP requests are transparently intercepted by a proxy caching server, network connectivity problems are likely to appear as 504 errors.
901 Host lookup failure: The host name could not be mapped to an IP address. This error (which is not part of the HTTP protocol) signifies a problem in retrieving the IP address of the server using the DNS services. Likely causes include changes of host names, and DNS server failures or connectivity problems.

3 Methodology

Our research involved the verification of URL references appearing in published material available on the Digital Library of the ACM and the IEEE Computer Society. The process comprised the following steps:

``crawl'' each site and download published articles,
convert articles to text,
extract URLs, and
verify URL accessibility.

Although a large number of document collections are available on the Web, we decided early-on to limit our research to two publications: IEEE Computer (Computer), and the Communications of the ACM (CACM). Our decision was based on the fact that both publications:

are available in electronic and paper form,
are distributed to a wide audience and are widely accessible,
are published in regular and frequent intervals, and
contain articles from a diverse set of CS disciplines and authors.

However, by concentrating our research on two publications we limited the generality of the obtained results. CACM and Computer do not represent the typical journal as, contrary to common practice, editors verify the URLs before publishing a paper thus filtering-out invalid URLs submitted by the authors or invalidated during the period leading to the publication. In addition, publications from other scientific domains, or with a different focus such as archival research journals, or general circulation magazines are likely to exhibit different characteristics regarding the appearance of URLs and their validity. To allow other researchers build upon our work, we have made available the complete set of URLs, their source, and the programs used to verify their accessibility at the article's companion web site http://www.spinellis.gr/sw/url-decay.

We first downloaded all articles appearing in the two publications using a set of programs that crawled through the digital libraries of the two organizations. This phase started on February 21st, 2000 and was completed on May 5th, 2000. Over 9GB of raw material were downloaded during the process.

CACM articles are available on the ACM digital library in PDF format. In order to extract URLs we first converted the articles into text form. CACM articles appearing in the library before 1995 are scanned images; we did not attempt to OCR those issues. As 1995 was also the earliest year in which Computer was available on-line we decided to use the articles in the period 1995-1999. In total we used 2471 items: 1411 articles from Computer (38.2MB of HTML) and 1060 articles from CACM (18.9MB of text).

We extracted URLs from the full text body of each article. The IEEE Computer Society digital library provides articles in both HTML and PDF format. Articles that appear in HTML format have embedded URLs tagged as hypertext references (HREF), which can be easily extracted. The extraction of URLs from the text of the CACM articles proved more challenging; it was performed using a custom filter program and manual inspection.

After extracting the URLs we removed duplicate appearances of URLs in the same article (21 cases for CACM, 362 for Computer). In total we ended up with 4224 URLs: 1391 (33%) obtained from CACM and 2833 (67%) obtained from Computer. We found a mean number of 1.71 URL references per article (median 0, mode 0) with a maximum of 127 URL references in a single article. A single complete URL was referenced by a mean number of 1.49 (median 1, mode 1) different articles in our sample with a maximum of 22 references for a single URL. The HTTP scheme was by far the most widely used: 4158 URLs (98%) used the HTTP scheme and only 66 URLs (2%) used the FTP scheme.

Finally, we verified the accessibility of each URL by attempting to download each identified resource. We repeated this procedure three times, with a weekly interval between different runs, starting at different times, from two different networks and hosts to investigate transient availability problems. No substantial differences were found between the runs. Here we report the results obtained on June 29th and 30th 2000. We did not merge positive results from different runs; our results reflect a model where a reader tries to access a URL one single time. We did not perform any semantic processing on the retrieved URLs; we assume that if a URL could be accessed its contents would match the intent of the original reference.

4 Results

Figure 1: URL retrieval results.

Despite our original reservations concerning the source material we used, the results we obtained have been corroborated by similar studies of web-published documents [9]. Of the URLs we checked 72% could be retrieved without a problem. The successful retrieval rates differ depending on the URL source: 63% for CACM and 77% for Computer URLs. This difference can probably be attributed to the fact that Computer URLs are tagged as such in the HTML article text. The reasons for failed URL retrievals are classified in Figure 1. By far the most common reason was that the resource referenced no longer existed on the given server (error 404, 60% of the failures). The second most common (22%) failure reason was an invalid host name (error 901), while network problems (error 504) only represented a 8% of the failures; a tribute to network availability. It is interesting to note that 83% of the failures can be attributed to invalid URL hostnames or paths (errors 901 and 404) i.e. that addressing in all its forms is the predominant factor in URL failures. The clustering of failure modes allows us from this point onwards to classify failed URLs into just two different groups: the network problems that occur while trying to reach the host (errors 504 and the DNS access subset of the 901 errors) and the server problems that occur while resolving the host name and once the host is reached.

Figure 2: URLs and retrieval results by year.

The temporal aspect of URL references and respective failures is extremely interesting. As can be seen in Figure 2, URL references exhibited an exponential increase from 1995 to 1998 and appear to be leveling-off afterwards. This plateau is to be expected since the number of references in articles are constant over time (editors in the examined journals often impose limits); URL references apparently increased by displacing citations to printed sources. The most striking result that can be deduced from Figure 2 is that in our sample

the half-life of a referenced URL is approximately 4 years from its publication date

i.e. that four years after their publication about 50% of the URLs are not accessible. It is also interesting to note that 20% of the URLs are not accessible one year after their publication and that after that the first year the URL decay is constant at about 10% per year for the next three years. Although URL decay appears to stabilize after that point, (a result that appeals to intuition-these will be URLs to authoritative sources on properly maintained servers) we have not sufficient historical data to substantiate this claim. The 20% decay during the first year can either be attributed to a high infant URL mortality or the large period an article takes from its inception to its publication.

Figure 3: URL decay from different time perspectives.

We were able to repeat the URL accessibility test two years after the 2000 exercise, on August 2002. The combined results of the two tests appear in Figure 3. What is apparent is the inexorable decline of the accessibility of the 1998 4-year old URL references towards the 60% mark (as opposed to the 50% we originally predicted), the further decline of the 1995 and 1996 URL accessibility towards 40%, and a small but significant difference between the aging behaviour of older and newer URLs. For the years for which we have comparable data (URL ages 3 to 5 years) more recent URLs (coming from the years 1997-1999 in the 2002 test) appear to be more accessible than their predecessors (years 1995-1997 in the 2000 test). This difference in URL aging over time can probably be attributed to increased author efforts to cite URLs that are less likely to disappear, and improved web site maintenance practices.

Figure 4: URLs and retrieval results by URL depth.

One result that appears to have both predictive and prescriptive value concerns the relationship between the path depth of a given URL and its failure rate. As can be seen in Figure 4, the number of URLs of a given path depth is linearly decreasing between path depths 0 and 2; it appears to decrease at a exponential rate after that point. What is significant is that while the network-related problems are, as expected, approximately constant relative to the path depth, server-related problems (mainly 404 errors) increase as the depth increases. While no direct causation can be deduced, we can point out that

deep path hierarchies are linked to increased URL failures.

This result is not immediately intuitive. A deep hierarchy is a sign that someone spent effort to organize content in a structure that should persist over time. We have two explanations for this result:

Each element of a URL path has a constant probability to fail due to changes of names, organizational structure, and maintenance personnel. These failures accumulate over all elements of the path.
URLs with a short path are more likely to be cited than those with a longer path, therefore site administrators try to keep them alive.

We also examined the relationship between two other URL properties-references to specific files and user directories-and the respective failure rates. We identified URLs that referenced specific files (e.g. http://www.acm.org/pubs.html) rather than directories (e.g. http://www.acm.org/cacm) assuming that if the last part of the URL contained a dot it referred to a file. The difference between the two URL classes is noteworthy: in total 40% of URLs referring to files could not be retrieved whereas only 23% of URLs referring to directories had the same problem.

Some HTTP servers allow the specification of separate content directories maintained by end-users using the ~username convention. We hypothesized that URLs to such user directories (which we found to be 13% of the total) were more likely to fail than others due to the higher mobility of individuals. In fact only 24% of these URLs had retrieval problems; the respective figure for the rest was 28%.

Figure 5: URLs and retrieval results by primary domain.

In Figure 5 we list the retrieval result of the referenced URLs according to their top-level domain. The domains .com, .edu and .org represent 74% of all referenced URLs. Other studies [7] have estimated that on a global scale only 6% of the Web servers have scientific or educational content. Since in our sample URLs in the .edu domain form 23% of the total we can deduce that these URLs are referenced three times more frequently than what would be expected by their population. It is interesting to note that URLs in the .com and .edu domains are equally likely to fail; a startling result given the radically different management models prevalent in educational establishments and companies. Also remarkable is the fact that URLs in the .org domain are less likely to fail than the other two categories; contrary to intuition, it appears that the management structures used by the ad-hoc volunteer efforts typically hosted in .org domains result in more stable Web sites.

5 Improving URI Longevity

Problems with links getting out of date are not new, they have been with us since the emergence of hypertext structures. A number of schemes have been suggested for maintaining error-free hypertext structures [1]. However, the unique problem posed by URLs appearing in print journals stems from the uneasy coexistence of two radically different in their affordances paradigms: hypertext-based electronic publishing and paper media. Once a URL is committed to paper it can not be modified and it might be difficult to locate and trace.

The emergence of publications that appear both electronically and on paper [6] can help alleviate the tensions between the two formats. It has certainly helped us trace URL references, and we can envisage a system that would keep hypertext references up-to-date in the non-paper part of hybrid publications. In the future, citation linking [5], provided as a service by publishers or public-service efforts [8], may lead to publication formats that actively support hypertext links across time.

Uniform Resource Names (URNs) have been proposed as a way to provide persistent, location-independent, resource identifiers [11]. However, URNs-typically consisting of an authority identifier followed by a string-are at a low level similar to URLs. Organizations that find it difficult to organize URLs will face the same problem with URNs and vice versa [2]. While URNs can solve the problem of maintaining connection with a moving target, they can not solve the problem of accessing deleted material.

A technology specifically targeting the persistent and interoperable identification and exchange of intellectual property in the digital environment is the Digital Object Identifier (DOI) system [12]. A DOI, consisting of a publisher prefix and a suffix identifying the work, is registered together with the corresponding URL and metadata in a central DOI directory working as a routing system. DOI-based requests are forwarded to the directory and are resolved to a current and valid URL supplied by the respective rights holder. Based on this technology, a large number of publishers (including the ACM and the IEEE) accounting for over 3,500 journals have teamed-up to provide CrossRef, a reference linking service.

One alternative way for reestablishing contact with invalid URL references is to use one of the Web's main search engines. However, research indicates that search engine coverage is low (around 16%) and decreasing, indexing varies considerably between different sites (with educational and non-US sites more likely to be disadvantaged), while the use of metadata which could be used to automatically locate citations is quite low [7]. The same research estimates the amount of data in the publicly indexable Web to 15 Tbytes; we therefore believe that the creation of historical ``snapshots'' of the Web is not within the realm of our current technical capabilities.

Based on our results, ways to alleviate the specific problem of invalid URL references appearing in print articles can be identified by concentrating on the distinct roles of the principal stakeholders involved in the process.

Professional societies, and publishers should draft and encourage the use of sound Web citation practices and verify referenced URLs as part of an article editing process. Both should also work towards establishing online repositories for Web material (such as the ACM Computing Research Repository) and endow those with policies to guarantee their longevity in the very long term. Publishers of archival research journals have an additional obligation towards the future generations that will access their material. Although some URLs are less important than others, a guideline limiting Web citations in archival journals to organized collections with concrete long-term retention policies may be the only responsible action consistent with the current state of the art.

Researchers should appreciate the limitations of Web citations regarding their probable lifespan and use them sparingly rather than gratuitously keeping in mind that the Web is not an organized library. Where possible, they should prefer citing the published version of a work to its online version, and citing material in organized collections over material in corporate or personal Web pages. In some cases they could even exert peer pressure to authors of material the would like to reference encouraging them to move it to an organised online repository. In addition, researchers should reference items using the shortest possible path, and avoid references to ephemeral data (e.g. non-archived news reports) and proprietary material.

Finally, maintainers of Web sites should try to preserve the validity of existing URLs and clearly indicate ephemeral pages that should not be cited (or linked). The standardization of appropriate HTML meta tags for indicating the projected longevity of a particular page will help all the above stakeholders unambiguously identify the extent to which a page can be cited.

A more ambitious scheme would have all stakeholders cooperate to establish a long term archive of referenced material, similar in nature with existing Internet archival efforst such as www.archive.org. Citations would reference the archived version of Web material:

http://archive.acm.org/http/2000.03.02.23.45.12/www.ibm.com/ai.htm

Under such a scheme libraries and professional societies would establish and promote the use of archival services perpetually keeping copies of referenced material-subject to intellectual property restrictions. Researchers would cooperate with Web-site producers to obtain copies of their material for archival while Web-site producers should be encouraged to draft and implement liberal policies for placing cited material under long-term archival custody.

6 Conclusions

Around 28% of the URLs referenced in Computer and CACM articles between 1995 and 1999 were no longer accessible in 2000; the figure increased to 41% in 2002. In addition, after four years 40-50% of the referenced URLs become inaccessible. A noteworthy parallel can be observed between the four years we calculated as the half life of referenced URLs and five years given as the median citation age for computer science [10]. One could claim that the self-organizing nature of the Web filters-out irrelevant URLs at approximately the same rate as these have traditionally been rendered obsolete in printed journal articles.

The Web has revolutionized on a global scale the way we distribute, disseminate and access information and, as a consequence, is creating a disruptive paradigm shift in the way human scientific knowledge builds upon and references existing work. In the past, libraries could provide reliable archival services for books and printed journals; the emergence of the Web is marginalizing their role. In the short term none of the approaches towards solving the general problem of dangling URL references is likely to be a panacea. It is therefore important to appreciate the importance of Web citations and invest in research, technical infrastructures, and social processes that will lead towards a new stable scientific publication paradigm.

Acknowledgements

Many thanks to Isidor Kouvelas, Vasilis Prevelakis, and the anonymous referees who contributed valuable comments on earlier drafts of this paper. Networking resources were provided by the Aegean-Net and GU-Net projects which are co-funded under the second EU Community Support Framework.

References

[1]: Helen Ashman. Electronic document addressing: Dealing with change. ACM Computing Surveys, 32(3):201-212, September 2000.
[2]: T. Berners-Lee. Cool URIs don't change. Online: http://www.w3.org/Provider/Style/URI, Current June 2002, 1998.
[3]: T. Berners-Lee, L. Masinter, and M. McCahill. RFC 1738: Uniform resource locators (URL), December 1994.
[4]: Peter J. Denning. The ACM digital library goes live. Communications of the ACM, 40(7):28-29, July 1997.
[5]: S. Hitchcock, L. Carr, S. Harris, J. M. N. Hey, and W. Hall. Citation linking: improving access to online journals. In Proceedings of the 2nd ACM international conference on Digital libraries, pages 115-122, Philadelphia, PA, USA, July 1999.
[6]: Rob Kling and Lisa Covi. Electronic journals and legitimate media in the systems of scholarly communication. The Information Society, 11(4):261-271, 1995.
[7]: Steve Lawrence and C. Lee Giles. Accessibility of information on the web. Nature, 400:107-109, 1999.
[8]: Steve Lawrence, C. Lee Giles, and Kurt Bollacker. Digital libraries and autonomous citation indexing. IEEE Computer, 32(6):67-71, June 1999.
[9]: Steve Lawrence, David M. Pennock, Gary William Flake, Frans M. Coetzee, Eric Glover, Finn Årup Nielsen, Andries Kruger, and C. Lee Giles. Persistence of web references in scientific research. Computer, 34(2):26-31, February 2001.
[10]: Arthur Jack Meadows. Communicating Research, pages 221-222. Academic Press, 1998.
[11]: R. Moats. RFC 2141: URN syntax, May 1997.
[12]: Norman Paskin. E-citations: actionable identifiers and scholarly referencing. Learned Publishing, 13(3):159-168, July 2000.