This software package learns gradient boosted regression tree ensembles with training performed in parallel with MPI. See "DOWNLOAD" to download the package, then refer to "REQUIREMENTS" and "INSTALLATION" for instructions to compile a C++ executable for training. Section "EXAMPLE" provides a simple usage example. See "USAGE" and "SCRIPTS" for a description of command line arguments and helpful Python scripts.
Version 0.9, September 2011. Address problems or report bugs to <swtyree at wustl.edu>.
The source distribution for this package may be downloaded here.
We publish our code under the updated BSD license. However, if you use it for scientific work please cite:
Stephen Tyree, Kilian Q. Weinberger, Kunal Agrawal, and Jennifer Paykin. Parallel Boosted Regression Trees for Web Search Ranking. Proceedings of the 20th international conference on World Wide Web (WWW), pages 387-396, ACM, 2011. BIBTEX
Parallel communication in pgbrt is supported by MPI, so this packages requires a recent distribution of MPI for compilation and execution. Both OpenMPI (http://www.open-mpi.org) and MPICH2 (http://www.mcs.anl.gov/research/projects/mpich2) are supported. If MPI is not already installed on your system, please refer to either distribution for installation instructions. Either distribution will provide the two binaries required below: a compiler wrapper (e.g. mpicxx) and a parallel execution handler (e.g. mpirun).
To compile with GCC options, run 'make' in the source directory. Call 'make intel' to compile with options for the Intel ICC compiler. The resulting binary will be written to the bin/ directory.
In many cases it may be necessary to modify the makefile to point to the desired MPI compiler wrapper (typically called mpicxx and located in the MPI installation bin/ directory).
Here is a simple usage example. Navigate to the example/ directory in the distribution. The following commands will train a model, cross-validate on validation set ERR, produce a test executable, and evaluate a test set using that executable. Depending on your system, replace mpirun with the appropriate MPI execution handler.
mpirun -np 2 ../bin/pgbrt train.dat 70 700 4 100 0.1 \ -V valid.dat -v 30 -m > out.log cat out.log | python ../scripts/crossval.py 5 -r \ | python ../scripts/compiletest.py test cat test.dat | ./test > test.pred
Here are the required and optional arguments to the pgbrt executable.
mpirun -np N_PROCS pgbrt TRAIN_FILE TRAIN_SIZE N_FEATURES DEPTH N_TREES RATE [OPTIONS] TRAIN_FILE training file TRAIN_SIZE number of instances in training file N_FEATURES number of features in data sets DEPTH maximum regression tree depth N_TREES number of regression trees to learn RATE learning rate/stepsize -V, -E validation/testing files -v, -e number of instances in validation/testing files -m compute and print ranking-specific metrics, ERR and NDCG@10 -t print timing information ("#timer EVENT ELAPSED_TIME") -h show help message
Required input includes a training data set (filename, size, number of features) in SVM^rank or SVM^light format and gradient boosting parameters (tree depth, number of trees, learning rate).
Output alternates by line between a depth first traversal of the current regression tree and current metrics computed on data sets. Metrics lines may contain the following items (or fewer depending on command line options).
This script provides a template for calling pgbrt with the required arguments and capturing the output. As provided, this script is configured to train on either Yahoo! LTRC Set 1 or Microsoft LETOR Fold 1. Use one of the following calls (for Yahoo or Microsoft data, respectively) or modify the script as needed.
python run.py PATH_TO_EXEC PATH_TO_DATA y NUM_PROCS NUM_TREES python run.py PATH_TO_EXEC PATH_TO_DATA m NUM_PROCS NUM_TREES
This script supports cross-validation on validation metrics computed by pgbrt. The script reads from stdin the trees and metrics output by pgbrt. It prints the trees set of trees selected by cross-validation. Cross-validation is performed on the metric indicated by VAL_METRIC_INDEX, e.g. 3 will cross-validate on the fourth metric in the comma-separated list printed by pgbrt. It is assumed that smaller metric values are better unless the -r option is specified.
cat LOG | python crossval.py VAL_METRIC_INDEX [-r]
This script compiles a lightweight C++ test executable for trees read from stdin, e.g. the output of crossval.py. The executable is written to EXEC_FILE. The default compiler command is "gcc -O3", but an alternative may be supplied as COMPILER.
cat TREES | python compiletest.py EXEC_FILE [COMPILER] cat TEST_FILE | ./EXEC_FILE > PRED_FILE
This script computes RMSE, ERR and NDCG metrics on predictions provided in PRED_FILE using labels and queries found in TEST_FILE.
python evaluate.py TEST_FILE PRED_FILE
This script was provided by Ananth Mohan in his package rt-rank, which may be found at https://sites.google.com/site/rtranking. The script incorporates a metrics computation script originally provided by Yahoo! for the Learning to Rank Challenge. The original may be found at http://learningtorankchallenge.yahoo.com/evaluate.py.txt.
Here is the provided description: "Script to compute the NDCG and ERR scores of a submission. Labels is a list of lines containing the relevance labels (one line per query). Ranks is a list of lines with the predicted ranks (again one line per query). The first integer in a line is the rank --in the predicted ranking-- of the first document, where first refers to the order of the data file. k is the truncation level for NDCG. It returns the mean ERR and mean NDCG."