The Wumpus Information Retrieval System – On-disk index data structures
Author: Stefan Buettcher (stefan@buettcher.org)
Last change: 2006-04-26
Wumpus performs all search operations using a collection of inverted files.
It maintains an in-memory inverted file, for buffering documents that were
recently added to the collection, and a set of on-disk inverted files. This
section gives a brief overview of the layout of the on-disk inverted files.
An inverted file realizes mapping from terms to their occurrences in the text
collection for which the index has been created. For each term, the inverted
file contains an inverted list (also called the term's posting list).
In Wumpus, an inverted list is simply a sequence of integers, called postings.
Each posting describes a single occurrence of the term within the text collection.
Postings per se are not associated with files or documents, but can be combined with
the postings for special terms, such as <doc>, </doc>,
<file!>, and </file!>.
Postings for the same term are grouped into posting list segments, where each segment
(except for the last, maybe) contains somewhere between 0.65 * β and 1.35 * β
postings. β is a configuration parameter that can be chosen arbitrarily. By default,
it is 216. Postings within a segment are compressed using a byte-aligned
compression method [1], which usually leads to a total segment size between 64 KB and
200 KB, depending on how frequent the respective term is.
The general layout of an on-disk inverted file is like this:
term descriptor | postings | term descriptor | postings | term descriptor | ... | search structures
All terms occur in lexicographical order (i.e., alphabetically).
Each term descriptor contains the respective term and some meta-information about
the posting list segment that follows the descriptor. The search structures at the
end of the index are used to find all postings for a given term during query processing.
During Wumpus' startup sequence, these search structures are loaded into memory and
transformed into a sorted list of terms and their respective on-disk positions in the
index. This sorted list does not contain information about all the terms. This is important,
because for very large text collections, with dozens of millions of different terms, this
could easily exceed the available amount of memory. Instead, the sorted list contains
the position of some terms in the index, with the constraint that the difference between
the on-disk locations of two subsequent terms in the sorted list is always within a
certain, predefined interval (with a few exceptions). By default, this interval is
[216, 217] bytes.
At query time, for each query term, Wumpus inspects the sorted list (stored in memory) by
performing a binary search for the given term. If the term can be found in the index,
its postings are loaded from the on-disk index for further processing. If the term cannot
be found, this can mean two things: 1. the term does not appear in the index, or 2. it does
appear in the index, but not in the sorted list that has been loaded into memory. For example,
if the terms "impatient" and "impolite" appear next to each other in the sorted list and the
term we are looking for is "implicit", then we know that, if the term appears in the index,
it needs to appear between "impatient" and "impolite" (because terms and their postings are
stored in lexicographical order). Therefore, it is sufficient to go to the on-disk index
position of "impatient" and do a linear scan until "impolite" is encountered. This lets us
easily find all postings for "implicit", at the overhead of reading 64 KB of data, which
in most cases is outweighed by the disk seek that is necessary to move the read head to
the position where the postings are stored.
Index maintenance
When documents are added to the index, Wumpus accumulates term descriptors and postings
into memory until a predefined space threshold is reached, at which point postings are
transferred from memory to disk, resulting in a new on-disk inverted file. Wumpus can
maintain multiple on-disk inverted files, but will try to keep their number small, in
order to keep the number of disk seeks necessary to fetch a given term's posting list
from the index small.
Index maintenance strategies supported by Wumpus are (plus some variations and hybridizations):
No Merge (never combines inverted files), Immediate Merge (only maintains a single on-disk
inverted file), Logarithmic Merge, and Sqrt Merge.
A description of these index maintenance strategies is given by [2], [3], and [4].
References
[1] F. Scholer, H.E. Williams, J. Yiannis and J. Zobel.
Compression of Inverted Indexes for Fast Query Evaluation.
In Proceedings of the 25th ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2002).
Tampere, Finland, August 2002.
[2] N. Lester, A. Moffat, and J. Zobel.
Fast On-Line Index Construction by Geometric Partitioning.
In Proceedings of the 14th ACM Conference on Information and Knowledge Management (CIKM 2005).
Bremen, Germany, November 2005.
[3] S. Büttcher and C.L.A. Clarke.
Indexing Time vs. Query Time Trade-offs in Dynamic Information Retrieval Systems.
University of Waterloo Technical Report CS-2005-31.
Waterloo, Canada, October 2005.
[4] S. Büttcher, C.L.A. Clarke, and B. Lushman.
Hybrid Index Maintenance for Growing Text Collections.
In Proceeding of the 29th ACM Conference on Research and Development in Information Retrieval (SIGIR 2006).
Seattle, USA, August 2006.