http://www.spinellis.gr/pubs/jrnl/2005-IR-PDI/html/Spi04b.html
This is an HTML rendering of a working paper draft that led to a publication. The publication should always be cited in preference to this draft using the following reference:

Citation(s): 2 (selected).

This document is also available in PDF format.

The document's metadata is available in BibTeX format.

Find the publication on Google Scholar

This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.

Diomidis Spinellis Publications

Index-based Persistent Document Identifiers

Diomidis Spinellis
Department Management Science and Technology
Athens University of Economics and Business1

Abstract

The infrastructure of a typical search engine can be used to calculate and resolve persistent document identifiers: a string that can uniquely identify and locate a document on the Internet without reference to its original location (URL). Bookmarking a document using such an identifier allows its retrieval even if the document's URL, and, in many cases, its contents change. Web client applications can offer facilities for users to bookmark a page by reference to a search engine and the persistent identifier instead of the original URL. The identifiers are calculated using a global Internet term index; a document's unique identifier consists of a word or word combination that occurs uniquely in the specific document. We use a genetic algorithm to locate a minimal unique document identifier: the shortest word or word combination that will locate the document. We tested our approach by implementing tools for indexing a document collection, calculating the persistent identifiers, performing queries, and distributing the computation and storage load among many computers.

``Users should beware that there is no general guarantee that a URL which at one time points to a given object continues to do so, and does not even at some later time point to a different object due to the movement of objects on servers.''

- T. Berners-Lee et al. Uniform Resource Locators (URL). RFC 1738.

1  Introduction

Internet resources are typically specified using the string representation of ``Uniform Resource Locators'' (URLs). URLs are a subset of the Uniform Resource Identifiers (URIs) that provide an abstract identification of a resource location [BLMM94]. URLs are often used to identify resources in hypertext links, printed media such as business cards, billboards, and publications, and in user-maintained collections such as bookmarks and visited site history files.

The dynamic nature of the web-Chankhunthod et al. [CDN+96] report the average lifetime of an HTML text object to be 75 days-results in URLs that quickly decay and become inaccessible [Pit99,Ash00]. According to our earlier work [Spi03] around 27% of the URLs referenced in IEEE Computer and the Communications of the ACM articles from 1995-2000 were no longer accessible at the end of the period. In addition, after four years almost 50% of the referenced URLs are inaccessible. Lawrence et al. [LPF+01] have identified similar trends for published URLs. A number of solutions have been proposed for handling this link integrity or referential integrity problem. The solution classes that have been identified [Ash00] include prohibiting change, maintaining document versions, regularly updating all links, aliasing a link's end points, posting notifications of changes to document locations, implementing forwarding mechanisms, automatically detecting and correcting broken links, and creating all links dynamically. For links to published papers citation linking [HCH+99], provided as a service by publishers or public-service efforts [LGB99], may lead to publication formats that actively support hypertext links across time.

Although URNs, PURLs and the Handle mechanism may offer a long-term solution, they have up to now not been universally adopted. Thus, individual user bookmarks and publicly distributed URLs quickly become obsolete as documents change names, directories, or are hosted on different places. Although a search engine [LG99,Tak00] can be used to relocate a document after a web server's ``404-document not found'' response, this is a tedious and error-prone procedure. In this paper we describe a way to automate the searching task by providing a more persistent alternative to a URL. Such an alternative can be used both to provide page bookmarks that are relatively immune to URL changes and as a centralized, alternative method for creating URNs without the active cooperation of the content creators.

Our scheme involves having a search engine calculate for every URL an augmented, persistent version. The augmented URL, containing the original URL and words that uniquely identify the document, will have a high probability to locate the original document even if its contents have changed location. Users who save the persistent URL to bookmark a page can later transparently retrieve the document through a search engine's infrastructure. If the default document retrieval mechanism fails, the search engine will resolve the URL by searching for documents containing the words embedded in the URL. In addition, web sites can use persistent URLs to point to pages outside their administrative domain with a lower probability that these links will become unavailable when the respective page contents change location. As an example, given the URL http://moving.org/target whose contents could be uniquely identified by the words gloxinia and obelisk the corresponding persistent URL would be of the form: http://resolve.com/find?orig=http://moving.org/target&w=gloxinia+obelisk (to make the example clearer we have not URL-encoded the original URL). When the user tries to access the above URL the search engine infrastructure at resolve.com will first try to retrieve the document at http://moving.org/target. If that fails, it will search its (up to date) index for a document containing the words gloxinia and obelisk. If one of the two above actions succeeds the user will be redirected to the document's original or revised location, otherwise a ``404 Not found'' error will be returned.

The remainder of this paper outlines the current process of document retrieval and the associated errors (Section 2), describes our algorithm for calculating unique document discriminants (Section 3), and sketches a prototype implementation of the concept (Section 4). In Section 5 we discuss the method's performance in terms of retrieval accuracy, time and space requirements, and scalability. The paper concludes with a presentation of possible extensions and applications of our technique.

2  Web Document Retrieval

In general, URLs consist of a scheme (e.g. http, ftp, mailto) followed by a colon and a scheme-specific part. The syntax of the scheme-specific part can vary according to the scheme. However, URL schemes that involve direct use of an IP-based protocol to an Internet host use the following common syntax:

//<user>:<password>@<host>:<port>/<url-path>

The double slash indicates that the scheme data complies with the Internet scheme syntax. The host is specified using the fully qualified domain name of a network host or its IP address. The default port for the HTTP scheme is 80 and is usually omitted.

For a web page of a given URL to appear on a browser's screen a number of different technologies and protocols must work in concert. In addition, the changing realities of the web and the Internet have vastly complicated the simple end-to-end request-reply protocol that used to form the basis of the early HTTP transactions. Any failure along the complicated chain of actions needed to retrieve a web page will lead to a failed URL reference.

The path appearing in a URL will nowadays not necessarily match with a corresponding local file on the server. Web servers provide a number of mechanisms for managing namespaces. Some of them are: the creation of a separate namespace for every local user, the definition of protection domains and access mechanisms, the support of aliases to map namespaces to local directories, and the dynamic creation of content using technologies such the common gateway interface (CGI), servlets, and server-modified pages (ASP, JSP, PSP). In addition, a feature of the HTTP protocol called content negotiation allows a server to provide different pages based on technical or cultural characteristics of the incoming request (e.g. bandwidth, display technology, languages the user can understand).

The HTTP protocol defines 24 different errors that can occur within an HTTP exchange. In addition, some errors can occur before the client and server get a chance to communicate. In practice, while verifying thousands of published URLs we encountered the following errors:

400 Bad request
The syntax used for the request could not be understood by the server. This may signify a badly formed URL often coupled with a browser bug.

401 Unauthorized
The request requires user authentication. Such an error can result when citations are given to URLs that exist within a domain of services that require registration, or when such services move from a free access to a registration-based model.

403 Forbidden
The server is refusing to fulfill the given request, in this case however proper authorization can not be used to retrieve the page. It is conceivable that URLs that are not part of the public Internet end up as citations when the authors fail to realize that they have special privileges to access certain repositories that do not apply to the global Internet population. As an example, our organization has transparent access to a collection of on-line journals with authentication based on the client IP address. URLs to this collection provided by unsuspecting users will typically generate a 403 error.

404 Not Found
This infamous and quite common response signifies that the server has not found anything matching the Request-URI. This error is typically generated when web site maintainers change file names that are part of the given URL path or entirely remove the referenced material. Note that this protocol error can be followed by customized content-typically HTML text that informs the user of the problem and provides alternative navigation options.

500 Internal Server Error
The server encountered an unexpected condition which prevented it from fulfilling the request. This error can occur when a server is wrongly configured, or, more commonly, if a program or database that is used to serve dynamic content fails.

503 Service Unavailable
The server is currently unable to handle the request due to temporary overloading or maintenance of the server. Errors of this type sometimes appear on a misconfigured server, or servers overwhelmed by traffic.

504 Gateway Time-out
The server, acting as a proxy or gateway did not receive a timely response from the upstream server specified by the URI (e.g. HTTP, FTP) or some other auxiliary server (e.g. domain name server-DNS) it needed to access in attempting to complete the request. When HTTP requests are transparently intercepted by a proxy caching server, network connectivity problems are likely to appear as 504 errors.

901 Host lookup failure
The host name could not be mapped to an IP address. This error (which is not part of the HTTP protocol) signifies a problem in retrieving the IP address of the server using the DNS services. Likely causes include changes of host names, and DNS server failures or connectivity problems.

Not all of the above problems can be solved by the provision of persistent URLs. A persistent URL will help in cases where the original document has changed its name (including the path to its name) resulting in a ``404 Not found'' error, and in cases where the domain hosting the URL is renamed resulting in a ``901 Host lookup failure'' error. Changes in a document's access authorization are not a technical but a legal problem, while the 500-class server errors are more appropriately handled by a robust network infrastructure and mechanisms such as proxies, mirrors, archives, and content delivery networks. In addition, persistent URLs can not deal with documents that are deleted or modified, unless the corresponding URLs are designed to work in concert with an appropriate archive repository.

3  Unique Document Discriminants

Our method for creating persistent URLs involves calculating a minimal set of identifying words that can be used to uniquely select a given document in a search engine query. Consider three documents and their respective word contents 1: A B C F, 2: A B D F, and 3: A K C D F. The minimal unique identifiers (discriminants) for these documents are for document 1: A \land B \land C, for document 2: D \land B, and for document 3: K. In calculating the minimal discriminants we also take into account the length of each word to minimize the length of the respective search engine query. Given such a discriminant a search engine query with that discriminant, will result in a single matching element: the identified document, irrespective of the document's location (URL). As an example our application calculated that the words sedam protectable currently uniquely identify the web page http://research.unc.edu/otd/inventors/overview.html. Thus a search engine query for the above terms will result in a single result: the corresponding document. We theorize that for large collections and a regularly updated search engine index, the discriminants will continue to uniquely and correctly identify the document, even as new documents are added to the collection. The form submission mechanism of many search engines allows one to form a URL that will automatically perform the above search and return the corresponding result; as an example the Google search engine URL for the page we described would be http://www.google.com/search?&q=sedam+protectable. Furthermore, a simple addition of an appropriate redirection header to the query results would even allow the entire operation to be transparently performed.

The computing infrastructure of a large search engine is ideally placed to efficiently perform both the calculation of the persistent URLs and their resolution. These two can then be provided as an extra service to the engine's users. In addition, the search engine will draw additional traffic (and potentially advertising revenues) each time a user accesses a bookmarked persistent URL or follows such a URL on the web.

The persistent URLs will most likely be less easy to remember than the URLs they are derived from. However, these URLs will be primarily stored as bookmarks and hyperlinks and automatically processed, rather than memorized and communicated by humans. For this reason, it is not necessary to use natural terms for searching; any terms that uniquely identify the document are suitable for this purpose.

The idea of locating documents by words those documents contain is not novel. Phelps and Wilensky [PW00] proposed the construction of robust hyperlinks by means of the similarly working lexical signatures and suggested a heuristic technique for calculating the signatures. Specifically, their method involves using terms that are rare in the web (have a high inverse document frequency-IDF), while also favoring terms with a high term frequency (TF) within the document, capping TF at 5 to avoid diluting a term's rarity. The IDF of each term is derived from a search engine, while the rest of the signature calculation can be performed locally. Park et al. [PPGK02] expanded on this idea by evaluating four basic and four hybrid lexical signature selection methods based on the TF, document frequency-DF, and IDF of those terms. For example, one of their proposed methods TFIDF3DF2 involves selecting two words based on increasing DF order, filtering out words having a DF value of one, and selecting three additional words maximizing TIFF. An important contribution of their work is an evaluation of the documents that the search engine returns in response to a lexical signature query in terms of uniqueness, appearance of the desired document at the top of the result list, and relevance of any other document links returned.

Our approach differs from the two methods we outlined in that it uses the search engine as an oracle for evaluating the selected word set. A stochastic algorithm can rapidly explore the search space (word combinations) to locate the ones that better suit a selection criterion. This allows us, instead of having a fixed algorithm (such TFIDF3DF2) identifying a document's discriminants, to flexibly select from each document the discriminant that maximizes an objective function. Although the objective function we used is based on uniqueness and URL length, different functions such as the relevance of the returned documents could also be used.

3.1  Discriminant Calculation

Trying all document's term combinations to find a unique discriminant is a futilely expensive exercise. The number of n different terms that can be selected from a document is given by
n
å
i = 1 
nCi = n
å
i = 1 
n!
i!(n-i)!
= 2n-1
At 31 average unique terms per document of our data set this gives us 4 ·109 different term combinations for each document. Although selecting fewer terms (in practice we found that unique discriminants consisted on average of 1.47 terms) lowers the above figure, the complexity's exponential nature makes the exhaustive search prohibitively expensive on larger documents; selecting 2 out of 1000 terms results in 1000C2 = 499,500 combinations (out of the total 1030 possible ones).

An efficient deterministic solution to the problem would be preferable to the exhaustive search we outlined above. However, as we will show in the following paragraphs, the problem is intractable, NP-complete.

3.2  Intractability Proof

We will prove that the problem of locating a discriminant that identifies k documents is NP-complete by demonstrating that a tractable P-time solution to the problem could be used to solve the subset sum problem, known to be NP-complete [GJ79]. We can formally express our problem as follows: Let D = {T1, ..., Tn} (our document) be a set of terms T. Each term Ti is expressed as a set of the documents it occurs in. Let D¢ Í D be a document's discriminant. The number of documents k that the discriminant D¢ with m = |D¢| identifies is expressed as the cardinality of the intersection of the corresponding sets:
k = ê
ê
m
Ç
i = 1 
D¢i ê
ê
A unique discriminant is one for which k = 1.

Similarly, the subset sum problem can be expressed as follows: Let A = {a1, ..., an} be a set of positive integers. Given an integer s find a set A¢ Í A with m = |A¢| so that

s = m
å
i = 1 
ai

If the n elements of the set A have values 0 ... m we can solve the subset sum problem in terms of the discriminant cardinality problem by using set intersection in the place of addition. For each element ai Î A we construct a set

Bi = ì
ï
í
ï
î
b0,1
···
bm,n
ü
ï
ý
ï
þ
- ì
ï
ï
í
ï
ï
î
b0,i
:
bm-ai,i
ü
ï
ï
ý
ï
ï
þ
and let B = {B1 ... Bi}. The sets we constructed have the property that given a set of positive integers Q = {q \mid q Î {1 ... n}}
ê
ê
|Q|
Ç
i = 1 
Bqi ê
ê
= |Q|
å
i = 1 
aqi
Using our hypothetical discriminant cardinality P-time algorithm we find a B¢ Í B such that
s = ê
ê
n
Ç
i = 1 
B¢i ê
ê
The ai elements that correspond to the B¢ subset will satisfy the subset sum problem
s = n
å
i = 1 
ai \mid ai Î A¢
Having shown that the subset sum problem-known to be NP-complete-can be reduced to the discriminant cardinality problem we have proved that the discriminant cardinality problem is also NP-complete.

3.3  Genetic Algorithm

We therefore use a non-deterministic, stochastic algorithm to search the term space. Genetic algorithms (GAs) [Hol75,Gol89,For96] are global optimization techniques that avoid many of the shortcomings exhibited by local search techniques on difficult search spaces, such as our unique discriminant selection problem. Goldberg [Gol94] describes a number of diverse GA applications, while Karr [Kar93] presents their use for modeling, design, and process control. GAs rely on modeling the problem as a population of organisms. Every organism represents a possible valid solution to the problem. Organisms are composed of alleles representing parts of a given solution. Standard genetic recombination operators are used to create new organisms out of existing ones by combining alleles of the existing organisms. In addition, mutations can randomly change the composition of existing organisms. Typically, the algorithm evaluates all the organisms of the population and creates new organisms by combining existing ones based on their fitness. This procedure is repeated until the variance of the population reaches a predefined minimum value or another heuristic criterion is satisfied.

GAs base their operation on a fitness function that evaluates an organism's suitability. The fitness function [`O](x) we want to maximize depends inversely on the number of documents ND a particular discriminant x identifies and, less, on its length L as determined by the number of words NW and the length of each word Li:

1
_
O
 
(x)
= 100 (ND(x)-1) + NW(x)
å
i = 1 
Li + 1
As an exception to the above function definition, organisms that fail to identify a single document (ND(X) = 0) are given a fitness rank of 0.

An important characteristic of a genetic algorithm's implementation concerns the representation of each candidate solution. A good representation should ensure that the application of standard crossover recombination operators (where a new organism is composed from parts of two existing ones) will result in a valid new representation. The first organism implementation we used was an ordered set of terms. Thus, for a document containing the words [A B C D E F] two organisms could be [A C F] and [B E]. Following experimentation, we found that a boolean vector sized to represent all possible terms of a document-with discriminant terms represented by true values-was a more efficient implementation allowing our code to function in a third of the original runtime. Using a boolean vector scheme, the two above organisms would now be represented as [T F T F F T] and [F T F F T F].

Using the integers 0 and 1 for representing the true and false boolean values, the genetic algorithm for selecting the minimal unique discriminant out of N different terms can be described in the following steps:

  1. [Initialize a population of size S.] Set P0 ¼S, 0 ¼N ¬ ërand[0 ¼1]û.
  2. [Evaluate population members creating the organism fitness vector T.] For i ¬ 0 ¼S: set Ti ¬ [`O](Pi).
  3. [Create roulette selection probability vector R . ] Set Ri ¬ åj = 0i (Tj / åk = 0STk).
  4. [Create new population using crossovers from the previous population.] For i ¬ 0 ¼S: select q and r using the roulette selection probability vector so that Rq £ rand [0 ¼1) < Rq+1 and Rr £ rand [0 ¼1) < Rr+1, if rand [0 ¼1) < crossover rate, set c ¬ ërand [0 ¼N)û, set P¢i, 0 ¼c ¬PRq, 0 ¼c, set P¢i, c + 1 ¼N ¬PRr, c + 1 ¼N; otherwise set P¢i ¬ PRq .
  5. [Introduce mutations.] For i ¬ 0 ¼S: for j ¬ 0 ¼N: if rand [0 ¼1) < mutation rate, set P¢i, j ¬ ërand[0 ¼1]û.

  6. [Keep fittest organism for elitist selection strategy.] Select f so that Tf ³ T0 ¼S, set P¢ërand [0 ¼S)û¬ Pf.
  7. [Make new population the current population.] Set P ¬ P¢.
  8. [Loop based on the population's variance.] If åi = 0P | Tf - Ti | > minimum variance go to step 2; otherwise the algorithm terminates with the optimal document discriminant in Pf.

The implementation of genetic algorithms can be tuned using a number of different parameters. In our implementation we used the parameters that Grefenstette [Gre86] derived using meta-search techniques namely:

The random floating point numbers 0 < R < 1 used for selecting the crossover points, the mutation rates, and the selection of organisms were produced using the subtractive method algorithm [Knu81].

In addition we used some domain-specific heuristics:

All the above heuristics are based on the premise that less frequently used terms are more likely to be part of a unique document discriminant.

4  Prototype Implementation

We implemented a set of programs that process a (presumably all-encompassing) set of web pages, and calculate for each page a minimal set of search terms (words) that can be used to uniquely identify that page within the set. To test our implementation we took advantage of the data set provided during the 2002 Google search engine [BP98] programming contest. The package provided to the contest participants included a programming framework for processing a pre-parsed document collection (the so called ripper program) and a large collection of web documents. The breadth and architecture of the ripper programming framework strongly indicate that applications based on it could be easily ported to run on the actual Google infrastructure. The 5.9 GB data set we used consists of 916,429 pre-parsed HTML documents containing about 28 ·106 terms.

4.1  Implementation Overview

We calculate the discriminants in two steps:

  1. We create an index of all documents where each term occurs; in actual practice a search engine will always have this data structure at hand.

  2. For each document we use the genetic algorithm to try combinations of the terms it contains until we find one that does not occur in another document. The processing relies on the index calculated by the previous phase.

As usual the devil is in the details, especially when dealing with the 1 million documents we processed and the 3 ·109 documents currently indexed by typical search engines.

Our application consists of tools for calculating unique discriminants on a single node (useful for proving the concept, trying out the data subset, and experimenting with the algorithms), and in an environment of multiple nodes (for processing the data set of a commercial search engine). In addition, we implemented simple tools for querying the results and obtaining the corresponding URLs.

4.2  Indexing

The index of a large text collection will not typically fit into a computer's fast main memory, while a disk-based structure will probably prove too slow for a realistic application scenario. Some applications have dealt with the problem by compressing the data in-memory [Mof92,Spi94], but this approach would still not accommodate our problem's scope and resource constraints.

We handled the problem by accumulating a term-to-document index in memory and monitoring the memory subsystem's performance. Once the system begins to persistently page (indicating that the memory's capacity has been reached and performance will rapidly degrade due to thrashing), the index is flushed to disk as a sorted file. When all documents have been processed, the partial results are merged into a single file (idxdata) containing terms and documents where each term occurs. An index file (idxdata.idx) allows rapid serial access to individual terms without having to traverse a term's document list. In addition, a separate file (idxdata.hash) allows rapid hash-code based access to individual terms. As the string hash function, we use the one recently proposed for very large collections by Zobel [ZHW01]. The hash file is created after the merge phase and can thus be optimally sized, using the Rabin-Miller prime number probabilistic algorithm [Sch96], to minimize collisions.

4.3  Stand-alone Operation

so.gif
Figure 1: Data-flow diagram of the stand-alone operation.

The data-flow diagram of our system's stand-alone operation appears in Figure 1. A new handler of the Google ripper named index reads-in preparsed documents and creates (in stages by merging intermediate results) an index of the documents where each term resides (idxdata) and two files for accessing that index (idxdata.idx and idxdata.hash). A fourth file, topnodes is set to contain the frequencies of the 100,000 most frequently used terms; it is used as a cache during the discriminant calculation phase.

The second phase is also implemented as a separate ripper handler named bookmark. This reads the terms of each document, selects the least frequently used ones, and creates the bookmarks file containing for each document its URL and the set of terms forming its unique discriminant. The bookmark.idx file created allows the location of a given document URL and discriminant based on the corresponding document identifier.

Two query programs we wrote, cgi_query and cmd_query, will search the term index against the conjunction of a set of specified words and display as search results the matching document URLs and the corresponding unique discriminants. The CGI application presents the discriminants as a URL; the end-user can bookmark this URL and thus visit our search engine to locate the document in the future. The HTML results also contain a link to a Google search with the same discriminant as a search expression. This search will not in general provide unique results since Google's indexed collection is three orders of magnitude larger than the one we processed. One of the discriminants we tried did however identify and correctly locate one page that was moved (renamed): a Google search for the page http://www.umass.edu/research/ogca/new.htm uniquely identified through the discriminant ``ogca fy02'' now yields http://www.umass.edu/research/ogca/news/oldnews.htm.

4.4  Distributed Operation

distr.gif
Figure 2: Data-flow diagram of the distributed operation.

As is apparent from Figure 2, the system's distributed operation is a lot more complicated than the stand-alone case. It does however provide a framework for creating discriminants for orders of magnitude larger collections using a large number of commodity processing nodes. The work distribution strategy is based on two premises:

  1. Documents are uniformly distributed across all processing nodes. Each node calculates and serves the discriminants for its documents.
  2. Each node is assigned a consecutive subset of terms (e.g. barometer-beholding). It is responsible for serving queries (documents that contain a given term) for the terms it is assigned.

To divide the term load across the nodes, we run a stand-alone instance of ripper --index on a small representative subset of documents. A separate text file contains a list of all processing nodes. The program divide:

On each node we then run an instance of ripper --index to process the node-specific preprocessed pages. The program scatter is then run on each node to split and copy the resulting term index according to the terms assigned to each node. Each node will thus receive its share of terms as indexed by all other nodes. The make_index program merges the node-specific terms generated by all nodes into a single unified index file for the given node. This file is accessed by the node's index_server program to provide term document occurrences to other nodes. The ripper_distr program run on each node communicates with the index_server responsible for a given term to obtain the global list of a term's documents. To reduce network communication overhead initial term frequencies (our algorithm uses a subset of a document's least frequently occurring terms for selecting the unique discriminant) are obtained from the local term file; we assume that the local term frequency distribution corresponds to the global one. The generated unique document discriminants can then be accessed by the distributed versions of the query programs by using a document's identification number for locating the node where the respective discriminant is used.

5  Evaluation

Important aspects of algorithms expected to process the global web include, apart from the quality of the results, their requirements on processing time and space, and their scalability.

5.1  Discriminant Performance

termlen.gif
Figure 3: Discriminant term length distribution

Number of terms # documents Document %
0 (no discriminant found) 5,107 0.56
1 531,713 58.02
2 331,638 36.19
3 41,132 4.49
4 5,956 0.65
5 783 0.08
6 80 0.01
7 16 0.00
8 3 0.00

Table 1: Discriminant term number distribution.

After processing the Google sample document collection we found that we needed an average of 1.47 terms to uniquely identify a document. The average length of each discriminant (Figure 3) was 10.77 characters which, as a document identifier, compares extremely favorably with the 43.9 characters of the collection's average document URL length (excluding the initial http://. The number of terms for each document's discriminant was distributed as shown in Table 1. The algorithm failed to identify a discriminant for less than 1% of the documents processed.

docmat.gif
Figure 4: Distribution of document matches across the calculated discriminants.

The property of the calculated discriminants to uniquely identify a document, while not absolute, was we believe acceptable for its intended uses. About 50% of the discriminants our system calculated will locate a single document, while another 10% identify two documents. In total 76% of the discriminants will locate less than 10 documents in the sample document collection (Figure 4). These figures can be further improved by tuning various GA parameters such as the number of terms of the candidate set, the number of common terms to eliminate, and the size of the organism pool. Many of the pages for which a unique identifier was not calculated contain very little textual material. As an example our system's spectacular failure to create a unique identifier for the page http://humanities.uchicago.edu/depts/maph/ (it was identified by the term `humanities', which also matches another 29,497 pages) can be easily explained by the fact that this home page consisted entirely of text and pictures presented in graphical hypertext form using image maps.

5.2  Algorithm Performance

On average the GA was run for 10.6 generations to calculate each discriminant. However, the distribution of the GA generations that were required was highly skewed: the corresponding mode was 2, median 5, and the standard deviation 16.5. To evaluate the performance of the GA over the heuristic selection of words, we calculated discriminants for a subset of 13,000 documents using a procedure that followed the selection traits of the GA, but not its evolutionary strategy. Specifically, for every document we created S = 50 organisms and let those mutate G times, where G was the number of generations the GA had run for that document. The initial organism was not random, but as was the case for the GA, was created using the allele probability selection vector. The results from this quasi-random selection were 34% worse than those obtained from the GA operation. In 58% of the cases the two methods yielded the same result. Although the advantage offered by the GA may not seem impressive, we believe that (a) only the GA will scale to handle the three orders of magnitude larger collection of the global web, and, (b) the GA can be easily adapted to work with more demanding objective functions, whereas the heuristic techniques can not.

5.3  Time Requirements

We indexed the sample document collection (916,429 documents 3,152,415 unique lowercased terms) on a 733MHz, Celeron CPU, 128MB RAM, 40GB IDE disk machine in 18,605 real, 7,064 user, and 1,533 system seconds, at a throughput of 29.26 documents / s. The process used 18 intermediate indexed files each of about 75MB in size (the first one was 190MB) with an intermediate file being dumped every six minutes. The hashed term database performed adequately, but not spectacularly with 2,111,778 total term name collisions (resulting on an average of 1.49 disk index accesses per term) and 173 maximum term name collisions.

The time to perform the unique discriminant calculations varied, because we split the workload among 20 different machines. The time required on 733MHz, Celeron CPU, IDE disk machine for processing the file pprepos.00 (16,564 documents) was 58,721 real, 184,054 user, and 9,604 system seconds giving a processing time of 3.54 s / document. Systems with a SCSI disk subsystem performed better.

We unfortunately lacked appropriate resources (a large farm of networked processing nodes of similar technical characteristics) to perform rigorous experiments on the distributed implementation of our system. We were however able to obtain a lower bound of the expected performance by running the programs on a small number of workstations. By extrapolating from our results, we calculated the expected distributed operation time T¢ given a standalone time of T as T¢ ³ T ×2.2.

5.4  Space Requirements

The indexing operation utilizes the maximum amount of main memory available, but is constrained by design to stop its memory usage growth once thrashing occurs. In our case, it processed without a problem the complete sample data set on machines with 128MB RAM. The off-line space requirements are comprised of the space needed to store the term index and its hash file; this was for the sample document collection 826,504,382 bytes for the index and 50,438,784 bytes for the hash file, giving an overhead of approximately 957 bytes per document.

The space requirements of the discriminant calculation phase are more difficult to judge. When performing calculations over the sample collection we observed a maximum resident set size 62,140KB. This number is likely to grow with a larger document set, but not by much, since it reflects the space needed to store the document instances of a document's least frequent terms. Given that prior to the candidate set selection, only term frequencies are stored in memory, the term selection process can be easily adjusted to dynamically select terms that will load in the main memory a fixed number of document occurrences, thereby providing a concrete bound to the memory usage.

5.5  Scalability

Will our approach gracefully scale to cover a search engine's complete page collection? We consider as a test case Google's 2 ·109 page collection and for our estimates we use a number of 8,000 processing nodes reported in the technical press as comprising Google's infrastructure [Wag01].

Computing the term index will therefore require:

2 ·109 ×0.02
8,000
s = 83 minutes
The communications overhead will be roughly equivalent to that of copying the index files across the network; given an index file size of
11,000 × 2 ·109
8,000
= 2.75 ·109 bytes
this will add an overhead of 8 minutes using a switched 100Mbit network with a 50% utilization rate. Merging the intermediate files is unlikely to present a problem; at one point an error in our thrash monitoring code resulted in 1,700 intermediate files which were merged without a problem.

The time to compute the discriminants will be larger. At 3.54 s / document the calculation of all discriminants will require

3.54 × 2 ·109
8,000
= 885,000 s = 10 days
One should keep in mind that this process will be required to run very infrequently for the entire document collection and can then be run incrementally as new documents are added (the entire point of the unique discriminants is that they remain valid with a very high probability even as the structure of the web changes).

Note that all our time figures are based on the results we obtained using low-end 733MHz Celeron PCs with IDE hard disks. In addition, the indexing phase can be omitted and the discriminant calculation phase can be easily adjusted to use a search engine's existing term index structure. The discriminant calculation algorithm only needs access to an oracle that answers the question of how many documents are matched by a given term combination. We assume that a search engine's infrastructure is engineered and tuned to efficiently answer the above query and should therefore preferably be used by our algorithm.

6  Conclusions and Further Work

In the previous sections we outlined how our application utilizes search engine technologies to address an important shortcoming of today's web in a scalable, and efficient manner. The alternative document identifiers we calculate are not only resilient to URL changes, but also almost a fourth of the size of conventional URLs.

However, the document identifiers we provide are not suitable as universal replacements for the URLs currently employed. In document retrieval terms their use involves a trade-off of noise over silence. Our calculation method is not perfect and there are cases where our algorithms will either fail to calculate a discriminant for a web page (e.g. when the page does not contain any text), or will calculate a discriminant that will match multiple pages. In addition, changes to the web contents and a search engine's coverage can make identifiers that were calculated to be unique match additional documents. Although we have not studied the temporal behavior of the discriminants we calculate, an obvious pathological case involves documents cloned from a given source document through small additions or changes. This cloning is a common operation and is related to the scale-free topology of the web [BAJ00]. Cloned documents will very probably match the discriminant calculated for their original ancestor. Based on the above, we believe that unique discriminants are most suited for situations where a human can make an informed choice for using them and remains in the loop during their use. As an example, personal bookmarks are particularly well suited for being stored using unique discriminants (or include unique discriminants as a fail over mechanism). Bookmarks are highly prone to aging since they are typically not formally maintained; in addition, once a bookmark matching several documents is followed, the user can intelligently choose between the different pages.

During our work, we noted a number of improvements that could be employed for optimizing the algorithm's performance and for further increasing the usefulness of the obtained results. These include:

We end our description, by noting how the realization of the application we outlined was made possible only through the combination of multiple computer science disciplines: information retrieval was the domain where our problem was formulated, algorithms and data structures provided the framework to obtain the solution, operating system concepts allowed us to bound the indexing memory requirements, complexity theory gave us the theoretical background for searching for algorithmic solutions, stochastic approaches were used for sidestepping the problem's NPC characteristics, and networking and distributed systems technology provided the framework for developing the distributed implementation. Increasingly, the immense scale of the web is necessitating the use of multidisciplinary approaches to tackle information retrieval problems.

Acknowledgements

During the system's implementation phase I benefited from fruitful discussions with Elisa Fragkaki, Stavros Grigorakakis, Vasilis Kapouleas, and Michalis Vazirgiannis. The string hashing expression used in the hashing functor was written by Justin Zobel. The prime number generator used for sizing the hashed database contains code developed by Bradley Smith and Greg Heileman at the University of New Mexico. The distributed version of the system relies on the socket++ library developed by Gnanasekaran Swaminathan and patched for Linux and FreeBSD by Lauri Nurmi. Finally, I would like to thank the paper's anonymous referees for their insightful and astute comments on an earlier version of this paper.

References

[Ash00]
Helen Ashman. Electronic document addressing: Dealing with change. ACM Computing Surveys, 32(3):201-212, September 2000.

[BAJ00]
Albert-László Barabási, Réka Albert, and Hawoong Jeong. Scale-free characteristics of random networks: the topology of the world-wide web. Physica, A(281):69-77, 2000.

[BLMM94]
T. Berners-Lee, L. Masinter, and M. McCahill. RFC 1738: Uniform resource locators (URL), December 1994. Updated by RFC1808, RFC2368 [Fie95,HMZ98].

[BP98]
Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual web search engine. Computer Networks, 30(1-7):107-117, April 1998. Seventh International World Wide Web Conference Proceedings (WWW7).

[CDN+96]
Anawat Chankhunthod, Peter B. Danzig, Chuck Neerdaels, Michael F. Schwartz, and Kurt J. Worrell. A hierarchical internet object cache. In USENIX Technical Conference Proceedings, Berkeley, CA, January 1996. Usenix Association.

[Cer85]
V. Cerny. Thermodynamical approach to the traveling salesman problem: an efficient simulation algorithm. Journal of Optimization Theory and Applications, 45:41-51, 1985.

[Fie95]
R. Fielding. RFC 1808: Relative uniform resource locators, June 1995. Updates RFC1738 [BLMM94]. Updated by RFC2368 [HMZ98].

[For96]
Stephanie Forrest. Genetic algorithms. ACM Computing Surveys, 28(1):77-83, March 1996.

[GJ79]
M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman and Company, 1979.

[Glo90]
F. Glover. Tabu search - part I. ORSA Journal on Computing, 1:190-206, 1990.

[Gol89]
David E. Goldberg. Genetic Algorithms: In Search of Optimization & Machine Learning. Addison-Wesley, 1989.

[Gol94]
David E. Goldberg. Genetic and evolutionary algorithms come of age. Communications of the ACM, 37(3):113-119, March 1994.

[Gre86]
John J. Grefenstette. Optimization of control parameters for genetic algorithms. IEEE Transactions on Systems, Man, and Cybernetics, 16(1):122-128, 1986.

[HCH+99]
S. Hitchcock, L. Carr, S. Harris, J. M. N. Hey, and W. Hall. Citation linking: improving access to online journals. In Proceedings of the 2nd ACM international conference on Digital libraries, pages 115-122, July 1999.

[HMZ98]
P. Hoffman, L. Masinter, and J. Zawinski. RFC 2368: The mailto URL scheme, July 1998. Updates RFC1738, RFC1808 [BLMM94,Fie95].

[Hol75]
J. H. Holland. Adaptation in Natural and Artificial Systems. University of Michigan Press, Ann Arbor, Michigan, 1975.

[KAJ94]
C. Koulamas, S. R. Antony, and R. Jaen. A survey of simulated annealing applications to operations research problems. Omega International Journal of Management Science, 22(1):41-56, 1994.

[Kar93]
Charles L. Karr. Genetic algorithms for modelling, design, and process control. In CIKM '93. Proceedings of the Second International Conference on Information and Knowledge Management, pages 233-238. ACM, 1993.

[Knu81]
Donald E. Knuth. The Art of Computer Programming, volume 2: Seminumerical Algorithms. Addison-Wesley, Reading, MA, second edition, 1981.

[LG99]
Steve Lawrence and C. Lee Giles. Searching the web: General and scientific information access. IEEE Communications, 37(1):116-122, 1999.

[LGB99]
Steve Lawrence, C. Lee Giles, and Kurt Bollacker. Digital libraries and autonomous citation indexing. IEEE Computer, 32(6):67-71, June 1999.

[LPF+01]
Steve Lawrence, David M. Pennock, Gary William Flake, Frans M. Coetzee, Eric Glover, Finn Årup Nielsen, Andries Kruger, and C. Lee Giles. Persistence of web references in scientific research. IEEE Computer, 34(2):26-31, February 2001.

[Mof92]
Alistair Moffat. Economical inversion of large text files. Computing Systems, 5(2):125-139, Spring 1992.

[Pit99]
James E. Pitkow. Summary of WWW characterizations. World Wide Web, 2(1-2):3-13, January 1999.

[PPGK02]
Seung-Taek Park, David Pennock, Lee Giles, and Robert Krovetz. Analysis of lexical signatures for finding lost or related documents. In Proceedings of the 25th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, pages 11-18, New York, August 2002. ACM, ACM Press.

[PW00]
Thomas A. Phelps and Robert Wilensky. Robust hyperlinks: Cheap, everywhere, now. In Proceedings of Digital Documents and Electronic Publishing (DDEP00), New York, September 2000.

[Sch96]
Bruce Schneier. Applied Cryptography. Wiley, New York, second edition, 1996.

[Spi94]
Diomidis Spinellis. The design and implementation of a legal text database. In Dimitris Karagiannis, editor, DEXA 94: 5th International Conference on Database and Expert Systems Applications, pages 339-348. Springer-Verlag, September 1994. Lecture Notes in Computer Science 856.

[Spi03]
Diomidis Spinellis. The decay and failures of web references. Communications of the ACM, 46(1):71-77, January 2003.

[Tak00]
M. Kobayashi K. Takeda. Information retrieval on the Web. ACM Computing Surveys, 32(2):144-173, June 2000.

[VA87]
P. J. M. Van Laarhoven and E. H. L. Aarts. Simulated Annealing: Theory and Applications. D. Reidel, Dordrecht, The Nethelands, 1987.

[Wag01]
Mitch Wagner. Google defies dot-com downturn. TechWeb, April 2001. Online http://www.techweb.com/wire/story/TWB20010427S0011 (current June 2002).

[ZHW01]
Justin Zobel, Steffen Heinz, and Hugh E. Williams. In-memory hash tables for accumulating text vocabularies. Information Processing Letters, 80(6):271-277, December 2001.


Footnotes:

1 Address: Patision 76, GR-104 34 Athina, Greece. email: dds@aueb.gr