The Wumpus Information Retrieval System

The Wumpus Information Retrieval System – Relevance queries

Author: Stefan Buettcher (stefan@buettcher.org)
Last change: 2005-05-13

Wumpus supports several kinds of relevance queries. The general syntax of a relevance query is:

@rank[method][count=N][id=ID][additional_parameters] what by scorer₁ ... scorer_m

The individual components of the query are:

method: This can be one of OKAPI, QAP, or QAP2. BM25 is an alias for OKAPI.
count: The number of index extents that should be returned by the scoring algorithm. If not specified, N=20 is assumed.
id: This is useful for batch processing. If a query ID is given, it will appear at the start of every result line, making it easier to tell which result line belongs to which query when multiple queries are submitted consecutively.
what: This is an arbitrary GCL expression that defined the list of index extents that are to be scored and whose relevance scores are to be reported to the user. For TREC-like queries, this is usually <doc>..</doc>. If the list of extents to be scored is not specified, behavior depends on the actual scoring function. BM25 will score entire files, while QAP will just report relevant passages without any information about the extent that contains a passage.
scorer_i: This can again be an arbitrary GCL expression, but is usually a single term (or maybe a phrase) for TREC-like relevance queries.

Shortcuts with the obvious meanings exist: @okapi, @bm25, @qap.

Depending on the actual scoring function used, it is possible to adjust the internal parameters of the function so that they meet the specific requirements of the application. For BM25, for example, the b and k₁ parameters can be modified as shown below.

Example queries

@rank[bm25][id=42][count=5] "<doc>".."</doc>" by "information", "retrieval"
42 16.733690 331444 331906
42 16.414286 33999512 34000002
42 16.310987 1788586 1789040
42 16.283293 34968932 34969266
42 16.223730 331907 332390
@0-Ok. (16 ms)

@okapi[k1=1.5][b=0.5][count=8] "<doc>".."</doc>" by "information", "retrieval"
0 18.165737 331444 331906
0 17.831360 33999512 34000002
0 17.577662 1788586 1789040
0 17.548101 331907 332390
0 17.147125 34968932 34969266
0 17.138239 27589177 27589551
0 16.899017 1445358 1445827
0 15.956622 1346297 1346997
@0-Ok. (18 ms)

@qap[count=5][id=32] "<doc>".."</doc>" by "information", "retrieval"
32 19.661619 1788586 1789040 1789004 1789005
32 19.661619 1445358 1445827 1445790 1445791
32 19.661619 27589177 27589551 27589246 27589247
32 19.661619 331444 331906 331869 331870
32 19.661619 331907 332390 332332 332333
@0-Ok. (15 ms)

The format of the result vector is similar for all scoring functions:

queryID score start end

In the case of QAP, the start and end positions of the top-scoring passage within the given extent are returned in addition.

Fancy stuff can be done with the additional parameters, which, in general, are specific to the scoring algorithm that is used:

@okapi[filename][count=4] "<doc>".."</doc>" by "information", "retrieval"
0 16.733690 331444 331906 /u1/stefan/corpora/chunks/chunk.00105
0 16.414286 33999512 34000002 /u1/stefan/corpora/chunks/chunk.00888
0 16.310987 1788586 1789040 /u1/stefan/corpora/chunks/chunk.00131
0 16.283293 34968932 34969266 /u1/stefan/corpora/chunks/chunk.00910
@0-Ok. (13 ms)

@okapi[count=5][addget="<docno>".."</docno>"] "<doc>".."</doc>" by "information", "retrieval"
0 16.733690 331444 331906 "<DOCNO>12318783</DOCNO>"
0 16.414286 33999512 34000002 "<DOCNO>10131461</DOCNO>"
0 16.310987 1788586 1789040 "<DOCNO>12295796</DOCNO>"
0 16.283293 34968932 34969266 "<DOCNO>8307719</DOCNO>"
0 16.223730 331907 332390 "<DOCNO>12318782</DOCNO>"
@0-Ok. (36 ms)

The addget parameter is especially useful if you don't want to submit a bunch of @get queries after your initial relevance query to fetch all the document IDs. Please note, however, that since every text fetch operation usually causes a disk seek, addget is not particularly fast. If you need maximum performance, you should cache the information you want to extract from the document (document ID in the above case) inside your application.

The full power of GCL can be used to impose structural constraints on both the extents that are to be scored and on the scorers:

@okapi[count=5][addget="<title>".."</title>"] ((("<doc>".."</doc>")<[500])>"medicine") by ("information"^"retrieval"), ("relevance"+"similarity")
0 12.392150 1788586 1789040 "<Title>Informed populations around the world.</Title>"
0 8.173442 3156607 3156862 "<Title>The National Center for Biotechnology Information.</Title>"
0 7.408664 22884815 22885148 "<Title>Academic confessions of high school students: an analysis of adolescents' developmental concerns.</Title>"
0 7.077546 22378559 22378931 "<Title>Effects of instruction on the decoding skills of children with phonological-processing problems.</Title>"
0 6.435469 8900179 8900340 "<Title>Social and preventive medicine: a scientific approach to questions of practical relevance.</Title>"
@0-Ok. (48 ms)

This query, for example, computes relevance scores for all documents that are at most 500 tokens long and contain the word "medicine". All such documents are scored using a conjunction of "information" and "retrieval" and a disjunction of "relevance" and "similarity". For all documents in the result set, the text of the Title field is returned.