Diomidis Spinellis. The Decay and Failures of URL References. Communications of the ACM, 46(1):71-77, January 2003.
In the article we examine the accessibility and decay rate of web references by extracting and inspecting 4224 ULR references from 2471 computer science articles that appeared over the last five years. Of those URLs 27% were not accessible, while close to 50% of them became inaccessible 4 years from the date they were published. In addition, we found that deep URL path hierarchies are linked to a larger number of failures; educational and research material on the Web is referenced three times more than its population representation; pages hosted by educational and commercial sites are equally probable to deteriorate. Two important article findings can be concisely stated as follows:
We then extracted the URLs appearing in each article. In total we processed 2471 articles: 1411 articles from Computer (38.2MB of HTML) and 1060 articles from CACM (18.9MB of text). After extracting the URLs we removed duplicate appearances of URLs in the same article (21 cases for CACM, 362 for Computer). We ended up with 4224 URLs: 1391 (33%) obtained from CACM and 2833 (67%) obtained from Computer.
From this page you can download: