The Wumpus Information Retrieval System – GCL
Author: Stefan Buettcher (stefan@buettcher.org)
Last change: 2006-04-26
The queries shown in the previous section are all relatively simple. In many
cases, you are searching for something that is more complicated than a simple
term or a phrase – you are looking for a text passage with a certain structure.
GCL can help you here.
The GCL (generalized concordance lists) query language [1] supports a
variety of operators to express structural constraints. All these operators are supported
by Wumpus and can be used in most places to replace simple word queries by more
complicated structural queries. GCL is based on the shortest-substring paradigm.
Shortest-substring requires that if two nested index extents (an extent is an arbitrary
interval in the index address space) satisfy a given expression, only the inner one
is returned by the respective operator. If, for instance, we are looking for all passages
in the text of Shakespeare's "Hamlet" that contain the words "to" and "be", the phrase
"to be" would be returned, but "to be or not to be" would not be returned, because it
contains "to be", which already satisfies the query conditions.
- (A ^ B) returns all extents that match both queries A and B.
- (A + B) returns all extents that match either A or B (or both).
- (A .. B) returns all extents that start with an extent that matches
A and end with an extent that matches B.
- (A > B) returns all extents that match A and contain at least
one extent that matches B.
- (A /> B) returns all extents that match A and do not contain
an extent that matches B.
- (A < B) returns all extents that match A and are contained in an
extent that matches B.
- (A /< B) returns all extents that match A and are not contained
in an extent that matches B.
- [n] returns all index extents of length n.
For example, these operators can be used in order to search for all passages, where
"wumpus" and "search" appear within 5 words from each other:
("wumpus"^"search")<[5]
The most important structure in most information retrieval applications are
documents. In order to access documents in Wumpus, they have to be marked
in some way inside the source files. If, for instance, the source file is an XML
document in which every document starts with a <doc> tag and ends with a
</doc> tag, then you can use GCL to create the set of all documents for you:
"<doc>".."</doc>"
This can easily be extended so that Wumpus returns a list of all documents that contain
a certain word:
("<doc>".."</doc>") > "wumpus"
Or all documents of a given length, say 1000:
(("<doc>".."</doc>")>[1000])<[1000]
The above GCL expression requires that every member of the result set is of the form
"<doc>".."</doc>", contains an interval of length 1000, and is contained
in an interval of length 1000. This is exactly the set of all documents of length 1000.
In many cases, the ^ and + operators can be thought of as boolean
AND and OR, respectively:
("<doc>".."</doc>") > ("man"^"woman")
returns all documents that contain both "man" and "woman", while
("<doc>".."</doc>") > ("dog"+"cat"+"mouse")
returns all documents that contain at least one of the three words "dog", "cat", and "mouse".
More complicated boolean queries can be realized by deeper nested expressions:
(("<doc>".."</doc>") > ("united"^"states")) /> ("america")
would return all documents that contain both "united" and "states", but
do not contain "america".
References
[1] C.L.A. Clarke, G.V. Cormack, and F.J. Burkowski. An Algebra for Structured Text
Search and a Framework for its Implementation. The Computer Journal,
38(1):43-56, 1995.