TREC Terabyte – Baseline Systems

This page explains how to conduct the performance comparison baseline runs for the TREC 2006 Terabyte track. All participating groups are encouraged to conduct at least one such run (more if time permits), using one of the publicly available retrieval systems listed below. This run will be conducted on a single computer and will hopefully help us to come up with a satisfying methodology to do reliable inter-system performance comparisons of information retrieval systems.

For each system, building the index and running the efficiency queries should each take somewhere between 5 and 30 hours, depending on the characteristics of the hardware you are using. Please make sure that the disk cache of your operating system is empty before you start a new run. Otherwise, the times reported by time might not reflect the true performance of your system. One method to do this is to create a file that is larger than the available amount of memory and run md5sum on that file. Another method is to reboot the machine.

Clarification: When building an index with any of the three systems, you need to use uncompressed input files instead of the gzip-compressed format that GOV2 is shipped in..

Systems available for performance baseline runs:

Indri
Versions: Indri 2.2, Indri 2.2-TREC-TB
Download: Indri 2.2 for Linux, Indri 2.2 for Windows, Indri 2.2-TREC-TB for Linux
Compilers tested: gcc 3.4.2, 4.0.2 (32-bit and 64-bit Linux)
Indri should run on every Linux system with gcc ≥ 3.0 and Windows systems with Visual Studio installed. It supports both 32-bit and 64-bit CPUs. In order to get a performance baseline using Indri, you need to about 250 GB of free disk space and about 2 GB of RAM. If you do not have enough disk space to index GOV2 with the stock version of Indri, you can use the modified version, Indri 2.2-TREC-TB, which only requires about 120 MB of disk space. It differs from Indri 2.2 in that it does not store copies of the text found in the documents.

Downloading and building Indri (instructions for Linux)

Download Indri 2.2 or Indri 2.2-TREC-TB. For Indri 2.2, execute the following command sequence:
tar xzf indri-2.2.tar.gz
cd indri-2.2
./configure --prefix=$PWD
make
For Indri 2.2-TREC-TB, execute the following command sequence:
tar xzf indri-2.2-TREC-TB.tar.gz
cd indri-2.2-TREC-TB
./configure --prefix=$PWD
make
Constructing an index for GOV2

Edit the file parameters/build_param. Change the content of the field corpus.path to the base directory of your local copy of the GOV2 collection. Then, execute
time ./buildindex/buildindex parameters/build_param parameters/stop_param
and report the real and user times for index construction efficiency. Please note that the parameter files in the parameters directory are only available if you downloaded Indri from this website, not if you downloaded the software from the Lemur project page.

Running the efficiency queries

From within the Indri directory, run
time ./runquery/runquery -index=indexdir -count=20 -trecFormat=1 -queryOffset=1 06.efficiency_topics.10k.indri > indri.out
to run the efficiency queries. Report the real and user times, as printed by the time command. The file 06.efficiency_topics.10k.indri contains the 10,000 efficiency queries from the file 06.efficiency_topics.10k in Indri-compatible query format. It can be generated from the official NIST query stream by using the nist2indri.perl Perl script.
Wumpus
Version: Wumpus 2006-04-26-TREC
Download: Wumpus 2006-05-03-TREC for Linux
Compilers tested: gcc 3.4.2, 4.0.2 (32-bit and 64-bit Linux)
Wumpus should run on every Intel-based Linux system with gcc ≥ 3.0. It supports both 32-bit and 64-bit CPUs. In order to get a performance baseline using Wumpus, you need to have 40 GB of free disk space and about 512 MB of RAM.

Downloading and building Wumpus (instructions for Linux)

Download Wumpus 2006-05-03-TREC. Then execute the following command sequence:
tar xzf wumpus-2006-05-03-TREC.tgz
cd wumpus-2006-05-03-TREC
make
Constructing an index for GOV2

From within the Wumpus directory, run
time ./bin/trec INDEX inputFile wumpus.out wumpus.log
and report the real and user times for index construction efficiency. Here, inputFile is the name of a file containing a list of all files the GOV2 collection consists of. It can be created by running
find path_to_gov2 -type f | sort > inputFile
where path_to_gov2 is the base directory of the GOV2 document data. The inputFile should contain 27,204 lines, with one filename per line. When the index construction process is done, the database/ subdirectory contains around 13 GB of data.

Running the efficiency queries

From within the Wumpus directory, run
time ./bin/trec QUERY 06.efficiency_topics.10k wumpus.out wumpus.log
to run the efficiency queries. Report the real and user times, as printed by the time command. The file 06.efficiency_topics.10k contains the 10,000 efficiency queries in Wumpus-compatible format, as supplied by NIST. No conversion is necessary.
Zettair
Version: Zettair 0.6.1
Download: Zettair 0.6.1 (for Linux, Windows, Mac OS)
Compilers tested: gcc 3.4.2, 4.0.2 (32-bit and 64-bit Linux)
Zettair should run on every Linux or Windows system with gcc ≥ 3.0. It supports both 32-bit and 64-bit CPUs. In order to get a performance baseline using Zettair, you need to have 60 GB of free disk space and about 2 GB of RAM.

Downloading and building Zettair (instructions for Linux and Cygwin)

Download Zettair 0.6.1. Then execute the following command sequence:
tar xzf zettair-0.6.1.tar.gz
cd zettair-0.6.1
./configure --prefix=$PWD
make
make install
Please not that Zettair will crash if you compile it on a 64-bit system by following the above procedure. In order to make it run under a 64-bit operating system, you need to make the following changes to Makefile after running ./configure:
change the CFLAGS line to: CFLAGS = -g -O2 -m32
change the CPPFLAGS line to: CPPFLAGS = -m32
change the LDFLAGS line to: LDFLAGS = -m32
Constructing an index for GOV2

From within the Zettair directory, run
time ./bin/zet -i -t TREC < inputFile
and report the real and user times for index construction efficiency. Here, inputFile is the name of a file containing a list of all files the GOV2 collection consists of. It can be created by running
find path_to_gov2 -type f | sort > inputFile
where path_to_gov2 is the base directory of the GOV2 document data. The inputFile should contain 27,204 lines, with one filename per line. When the index construction process is done, the Zettair directory contains around 46 GB of data.

Running the efficiency queries

From within the Zettair directory, run
time ./bin/zet_trec -f 06.efficiency_topics.10k.zettair -r zettair -n 20 -t index > zettair.out
to run the efficiency queries. Report the real and user times, as printed by the time command. The file 06.efficiency_topics.10k.zettair contains the 10,000 efficiency queries from the file 06.efficiency_topics.10k in Zettair-compatible query format. It can be generated from the official NIST query stream by using the nist2zettair.perl Perl script.

Stefan Buettcher, 2006-09-01
To report bugs on this page, please send an email to sbuettch@plg.uwaterloo.ca.