TREC Terabyte – Baseline Systems
This page explains how to conduct the performance comparison baseline
runs for the TREC 2006 Terabyte track. All participating groups are encouraged
to conduct at least one such run (more if time permits), using one of the
publicly available retrieval systems listed below. This run will be conducted
on a single computer and will hopefully help us to come up with a satisfying
methodology to do reliable inter-system performance comparisons of information
retrieval systems.
For each system, building the index and running the efficiency queries should
each take somewhere between 5 and 30 hours, depending on the characteristics of
the hardware you are using.
Please make sure that the disk cache of your operating system is empty before you
start a new run. Otherwise, the times reported by time might not
reflect the true performance of your system. One method to do this is to create
a file that is larger than the available amount of memory and run md5sum
on that file. Another method is to reboot the machine.
Clarification: When building an index with any of the three
systems, you need to use uncompressed input files instead of the gzip-compressed
format that GOV2 is shipped in..
Systems available for performance baseline runs:
-
Indri
Versions: Indri 2.2, Indri 2.2-TREC-TB
Download: Indri 2.2 for Linux, Indri 2.2 for Windows, Indri 2.2-TREC-TB for Linux
Compilers tested: gcc 3.4.2, 4.0.2 (32-bit and 64-bit Linux)
Indri should run on every Linux system with gcc ≥ 3.0 and Windows systems
with Visual Studio installed. It supports both 32-bit and 64-bit CPUs. In
order to get a performance baseline using Indri, you need to about 250 GB of
free disk space and about 2 GB of RAM.
If you do not have enough disk space to index GOV2 with the stock version of Indri,
you can use the modified version, Indri 2.2-TREC-TB, which only requires about 120 MB
of disk space. It differs from Indri 2.2 in that it does not store copies of the
text found in the documents.
Downloading and building Indri (instructions for Linux)
Download Indri 2.2 or Indri 2.2-TREC-TB.
For Indri 2.2, execute the following command sequence:
tar xzf indri-2.2.tar.gz
cd indri-2.2
./configure --prefix=$PWD
make
For Indri 2.2-TREC-TB, execute the following command sequence:
tar xzf indri-2.2-TREC-TB.tar.gz
cd indri-2.2-TREC-TB
./configure --prefix=$PWD
make
Constructing an index for GOV2
Edit the file parameters/build_param. Change the content of the field
corpus.path to the base directory of your local copy of the GOV2 collection.
Then, execute
time ./buildindex/buildindex parameters/build_param parameters/stop_param
and report the real and user times for index construction
efficiency.
Please note that the parameter files in the parameters directory are
only available if you downloaded Indri from this website, not if you downloaded
the software from the Lemur project page.
Running the efficiency queries
From within the Indri directory, run
time ./runquery/runquery -index=indexdir -count=20 -trecFormat=1 -queryOffset=1 06.efficiency_topics.10k.indri > indri.out
to run the efficiency queries. Report the real and user
times, as printed by the time command. The file
06.efficiency_topics.10k.indri contains the 10,000 efficiency queries
from the file 06.efficiency_topics.10k
in Indri-compatible query format. It can be generated from the official NIST
query stream by using the nist2indri.perl Perl script.
-
Wumpus
Version: Wumpus 2006-04-26-TREC
Download: Wumpus 2006-05-03-TREC for Linux
Compilers tested: gcc 3.4.2, 4.0.2 (32-bit and 64-bit Linux)
Wumpus should run on every Intel-based Linux system with gcc ≥ 3.0.
It supports both 32-bit and 64-bit CPUs. In order to get a performance
baseline using Wumpus, you need to have 40 GB of free disk space and
about 512 MB of RAM.
Downloading and building Wumpus (instructions for Linux)
Download Wumpus 2006-05-03-TREC.
Then execute the following command sequence:
tar xzf wumpus-2006-05-03-TREC.tgz
cd wumpus-2006-05-03-TREC
make
Constructing an index for GOV2
From within the Wumpus directory, run
time ./bin/trec INDEX inputFile wumpus.out wumpus.log
and report the real and user times for index construction
efficiency. Here, inputFile is the name of a file containing a list of
all files the GOV2 collection consists of. It can be created by running
find path_to_gov2 -type f | sort > inputFile
where path_to_gov2 is the base directory of the GOV2 document
data. The inputFile should contain 27,204 lines, with one
filename per line.
When the index construction process is done, the database/
subdirectory contains around 13 GB of data.
Running the efficiency queries
From within the Wumpus directory, run
time ./bin/trec QUERY 06.efficiency_topics.10k wumpus.out wumpus.log
to run the efficiency queries. Report the real and user
times, as printed by the time command. The file
06.efficiency_topics.10k contains the 10,000 efficiency queries
in Wumpus-compatible format,
as supplied by NIST. No conversion is necessary.
-
Zettair
Version: Zettair 0.6.1
Download: Zettair 0.6.1 (for Linux, Windows, Mac OS)
Compilers tested: gcc 3.4.2, 4.0.2 (32-bit and 64-bit Linux)
Zettair should run on every Linux or Windows system with gcc ≥ 3.0.
It supports both 32-bit and 64-bit CPUs. In order to get a performance
baseline using Zettair, you need to have 60 GB of free disk space and
about 2 GB of RAM.
Downloading and building Zettair (instructions for Linux and Cygwin)
Download Zettair 0.6.1.
Then execute the following command sequence:
tar xzf zettair-0.6.1.tar.gz
cd zettair-0.6.1
./configure --prefix=$PWD
make
make install
Please not that Zettair will crash if you compile it on a 64-bit system
by following the above procedure. In order to make it run under a 64-bit
operating system, you need to make the following changes to Makefile
after running ./configure:
change the CFLAGS line to: CFLAGS = -g -O2 -m32
change the CPPFLAGS line to: CPPFLAGS = -m32
change the LDFLAGS line to: LDFLAGS = -m32
Constructing an index for GOV2
From within the Zettair directory, run
time ./bin/zet -i -t TREC < inputFile
and report the real and user times for index construction
efficiency. Here, inputFile is the name of a file containing a list of
all files the GOV2 collection consists of. It can be created by running
find path_to_gov2 -type f | sort > inputFile
where path_to_gov2 is the base directory of the GOV2 document
data. The inputFile should contain 27,204 lines, with one
filename per line. When the index construction process is done, the
Zettair directory contains around 46 GB of data.
Running the efficiency queries
From within the Zettair directory, run
time ./bin/zet_trec -f 06.efficiency_topics.10k.zettair -r zettair -n 20 -t index > zettair.out
to run the efficiency queries. Report the real and user
times, as printed by the time command. The file
06.efficiency_topics.10k.zettair contains the 10,000 efficiency queries
from the file 06.efficiency_topics.10k
in Zettair-compatible query format. It can be generated from the official NIST
query stream by using the nist2zettair.perl Perl script.
Stefan Buettcher, 2006-09-01
To report bugs on this page, please send an email to sbuettch@plg.uwaterloo.ca.