The Wumpus Information Retrieval System – GCL

Author: Stefan Buettcher (stefan@buettcher.org)
Last change: 2006-04-26

The queries shown in the previous section are all relatively simple. In many cases, you are searching for something that is more complicated than a simple term or a phrase – you are looking for a text passage with a certain structure. GCL can help you here.

The GCL (generalized concordance lists) query language [1] supports a variety of operators to express structural constraints. All these operators are supported by Wumpus and can be used in most places to replace simple word queries by more complicated structural queries. GCL is based on the shortest-substring paradigm. Shortest-substring requires that if two nested index extents (an extent is an arbitrary interval in the index address space) satisfy a given expression, only the inner one is returned by the respective operator. If, for instance, we are looking for all passages in the text of Shakespeare's "Hamlet" that contain the words "to" and "be", the phrase "to be" would be returned, but "to be or not to be" would not be returned, because it contains "to be", which already satisfies the query conditions.
For example, these operators can be used in order to search for all passages, where "wumpus" and "search" appear within 5 words from each other:
("wumpus"^"search")<[5]
The most important structure in most information retrieval applications are documents. In order to access documents in Wumpus, they have to be marked in some way inside the source files. If, for instance, the source file is an XML document in which every document starts with a <doc> tag and ends with a </doc> tag, then you can use GCL to create the set of all documents for you:
"<doc>".."</doc>"
This can easily be extended so that Wumpus returns a list of all documents that contain a certain word:
("<doc>".."</doc>") > "wumpus"
Or all documents of a given length, say 1000:
(("<doc>".."</doc>")>[1000])<[1000]
The above GCL expression requires that every member of the result set is of the form "<doc>".."</doc>", contains an interval of length 1000, and is contained in an interval of length 1000. This is exactly the set of all documents of length 1000.

In many cases, the ^ and + operators can be thought of as boolean AND and OR, respectively:
("<doc>".."</doc>") > ("man"^"woman")
returns all documents that contain both "man" and "woman", while
("<doc>".."</doc>") > ("dog"+"cat"+"mouse")
returns all documents that contain at least one of the three words "dog", "cat", and "mouse". More complicated boolean queries can be realized by deeper nested expressions:
(("<doc>".."</doc>") > ("united"^"states")) /> ("america")
would return all documents that contain both "united" and "states", but do not contain "america".


References
[1] C.L.A. Clarke, G.V. Cormack, and F.J. Burkowski. An Algebra for Structured Text Search and a Framework for its Implementation. The Computer Journal, 38(1):43-56, 1995.