The Wumpus Information Retrieval System – On-disk index data structures

Author: Stefan Buettcher (stefan@buettcher.org)
Last change: 2006-04-26

Wumpus performs all search operations using a collection of inverted files. It maintains an in-memory inverted file, for buffering documents that were recently added to the collection, and a set of on-disk inverted files. This section gives a brief overview of the layout of the on-disk inverted files.

An inverted file realizes mapping from terms to their occurrences in the text collection for which the index has been created. For each term, the inverted file contains an inverted list (also called the term's posting list). In Wumpus, an inverted list is simply a sequence of integers, called postings. Each posting describes a single occurrence of the term within the text collection. Postings per se are not associated with files or documents, but can be combined with the postings for special terms, such as <doc>, </doc>, <file!>, and </file!>.

Postings for the same term are grouped into posting list segments, where each segment (except for the last, maybe) contains somewhere between 0.65 * β and 1.35 * β postings. β is a configuration parameter that can be chosen arbitrarily. By default, it is 2¹⁶. Postings within a segment are compressed using a byte-aligned compression method [1], which usually leads to a total segment size between 64 KB and 200 KB, depending on how frequent the respective term is.

The general layout of an on-disk inverted file is like this:

term descriptor | postings | term descriptor | postings | term descriptor | ... | search structures

All terms occur in lexicographical order (i.e., alphabetically). Each term descriptor contains the respective term and some meta-information about the posting list segment that follows the descriptor. The search structures at the end of the index are used to find all postings for a given term during query processing.

During Wumpus' startup sequence, these search structures are loaded into memory and transformed into a sorted list of terms and their respective on-disk positions in the index. This sorted list does not contain information about all the terms. This is important, because for very large text collections, with dozens of millions of different terms, this could easily exceed the available amount of memory. Instead, the sorted list contains the position of some terms in the index, with the constraint that the difference between the on-disk locations of two subsequent terms in the sorted list is always within a certain, predefined interval (with a few exceptions). By default, this interval is [2¹⁶, 2¹⁷] bytes.

At query time, for each query term, Wumpus inspects the sorted list (stored in memory) by performing a binary search for the given term. If the term can be found in the index, its postings are loaded from the on-disk index for further processing. If the term cannot be found, this can mean two things: 1. the term does not appear in the index, or 2. it does appear in the index, but not in the sorted list that has been loaded into memory. For example, if the terms "impatient" and "impolite" appear next to each other in the sorted list and the term we are looking for is "implicit", then we know that, if the term appears in the index, it needs to appear between "impatient" and "impolite" (because terms and their postings are stored in lexicographical order). Therefore, it is sufficient to go to the on-disk index position of "impatient" and do a linear scan until "impolite" is encountered. This lets us easily find all postings for "implicit", at the overhead of reading 64 KB of data, which in most cases is outweighed by the disk seek that is necessary to move the read head to the position where the postings are stored.

Index maintenance
When documents are added to the index, Wumpus accumulates term descriptors and postings into memory until a predefined space threshold is reached, at which point postings are transferred from memory to disk, resulting in a new on-disk inverted file. Wumpus can maintain multiple on-disk inverted files, but will try to keep their number small, in order to keep the number of disk seeks necessary to fetch a given term's posting list from the index small.

Index maintenance strategies supported by Wumpus are (plus some variations and hybridizations): No Merge (never combines inverted files), Immediate Merge (only maintains a single on-disk inverted file), Logarithmic Merge, and Sqrt Merge. A description of these index maintenance strategies is given by [2], [3], and [4].

References
[1] F. Scholer, H.E. Williams, J. Yiannis and J. Zobel. Compression of Inverted Indexes for Fast Query Evaluation. In Proceedings of the 25th ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2002). Tampere, Finland, August 2002.
[2] N. Lester, A. Moffat, and J. Zobel. Fast On-Line Index Construction by Geometric Partitioning. In Proceedings of the 14th ACM Conference on Information and Knowledge Management (CIKM 2005). Bremen, Germany, November 2005.
[3] S. Büttcher and C.L.A. Clarke. Indexing Time vs. Query Time Trade-offs in Dynamic Information Retrieval Systems. University of Waterloo Technical Report CS-2005-31. Waterloo, Canada, October 2005.
[4] S. Büttcher, C.L.A. Clarke, and B. Lushman. Hybrid Index Maintenance for Growing Text Collections. In Proceeding of the 29th ACM Conference on Research and Development in Information Retrieval (SIGIR 2006). Seattle, USA, August 2006.