Main

pGBRT: Parallel Gradient Boosted Regression Trees

Main.Pgbrt History

Hide minor edits - Show changes to output

July 23, 2014, at 11:35 AM by 128.252.19.138 -
Changed lines 14-19 from:
to:
'''LICENSE'''

We publish our code under the updated BSD license. However, if you use it for scientific work please cite:

''Stephen Tyree, Kilian Q. Weinberger, Kunal Agrawal, and Jennifer Paykin. Parallel Boosted Regression Trees for Web Search Ranking. Proceedings of the 20th international conference on World Wide Web (WWW), pages 387-396, ACM, 2011.'' [[http://research.engineering.wustl.edu/~tyrees/bibtex/tyree2011parallel.bib | BIBTEX]]
Changed line 95 from:
"Script to compute the NDCG and ERR scores of a submission. Labels is a list of lines containing the relevance labels (one line per query). Ranks is a list of lines with the predicted ranks (again one line per query). The first integer in a line is the rank --in the predicted ranking-- of the first document, where first refers to the order of the data file. k is the truncation level for NDCG. It returns the mean ERR and mean NDCG."
to:
"Script to compute the NDCG and ERR scores of a submission. Labels is a list of lines containing the relevance labels (one line per query). Ranks is a list of lines with the predicted ranks (again one line per query). The first integer in a line is the rank --in the predicted ranking-- of the first document, where first refers to the order of the data file. k is the truncation level for NDCG. It returns the mean ERR and mean NDCG."
Changed line 7 from:
Version 0.9, August 2011. Address problems or report bugs to <swtyree at wustl.edu>.
to:
Version 0.9, September 2011. Address problems or report bugs to <swtyree at wustl.edu>.
Changed line 76 from:
This script compiles a lightweight C++ test executable for trees read from stdin, e.g. the output of crossval.py. The executable is written to EXEC_FILE. The default compiler command is 'gcc -O3', but an alternative may be supplied as COMPILER.
to:
This script compiles a lightweight C++ test executable for trees read from stdin, e.g. the output of crossval.py. The executable is written to EXEC_FILE. The default compiler command is "gcc -O3", but an alternative may be supplied as COMPILER.
August 30, 2011, at 07:41 PM by 172.16.20.31 -
Changed line 66 from:
This script provides a template for calling pgbrt with the required arguments and capturing the output. As provided, this script is configured to train on either Yahoo! [[http://learningtorankchallenge.yahoo.com/ | LTRC]] Set 1 or Microsoft [[http://research.microsoft.com/en-us/projects/mslr/ | LETOR]] Fold 1. Use one of the following calls (for Yahoo or Microsoft data, respectively) or modify the script as needed.
to:
This script provides a template for calling pgbrt with the required arguments and capturing the output. As provided, this script is configured to train on either [[http://learningtorankchallenge.yahoo.com/ | Yahoo! LTRC]] Set 1 or [[http://research.microsoft.com/en-us/projects/mslr/ | Microsoft LETOR]] Fold 1. Use one of the following calls (for Yahoo or Microsoft data, respectively) or modify the script as needed.
August 30, 2011, at 07:41 PM by 172.16.20.31 -
Changed line 66 from:
This script provides a template for calling pgbrt with the required arguments and capturing the output. As provided, this script is configured to train on either Yahoo! LTRC Set 1 or Microsoft LETOR Fold 1. Use one of the following calls (for Yahoo or Microsoft data, respectively) or modify the script as needed.
to:
This script provides a template for calling pgbrt with the required arguments and capturing the output. As provided, this script is configured to train on either Yahoo! [[http://learningtorankchallenge.yahoo.com/ | LTRC]] Set 1 or Microsoft [[http://research.microsoft.com/en-us/projects/mslr/ | LETOR]] Fold 1. Use one of the following calls (for Yahoo or Microsoft data, respectively) or modify the script as needed.
August 30, 2011, at 06:59 PM by 172.16.20.31 -
Changed lines 59-62 from:
[@iteration,train_rmse,train_err,train_ndcg,valid_rmse,valid_err,valid_ndcg,
test_rmse,test_err,test_ndcg@]
to:
[@iteration,train_rmse,train_err,train_ndcg,valid_rmse,valid_err,valid_ndcg,test_rmse,test_err,test_ndcg
@]
Changed lines 72-73 from:
[@cat LOG | python crossval.py VAL_METRIC_INDEX [-r]@]
to:
[@cat LOG | python crossval.py VAL_METRIC_INDEX [-r]
@]
Changed lines 82-83 from:
[@python evaluate.py TEST_FILE PRED_FILE@]
to:
[@python evaluate.py TEST_FILE PRED_FILE
@]
August 30, 2011, at 02:38 PM by 172.16.20.31 -
Added lines 1-2:
(:title pGBRT: Parallel Gradient Boosted Regression Trees :)
August 30, 2011, at 02:35 PM by 172.16.20.31 -
Changed lines 3-6 from:
This software package supports machine learning with gradient boosted regression tree ensembles. See "DOWNLOAD" for the source distribution and "INSTALLATION" for instructions to compile a C++ executable for training. Review "USAGE" and "SCRIPTS" for a description of command line arguments and helpful Python scripts. See "EXAMPLE" for a simple example usage.

Version 0.9, August 2011. Report bugs to <swtyree at wustl
.edu>.
to:
This software package learns gradient boosted regression tree ensembles with training performed in parallel with MPI. See "DOWNLOAD" to download the package, then refer to "REQUIREMENTS" and "INSTALLATION" for instructions to compile a C++ executable for training. Section "EXAMPLE" provides a simple usage example. See "USAGE" and "SCRIPTS" for a description of command line arguments and helpful Python scripts.

Version 0
.9, August 2011. Address problems or report bugs to <swtyree at wustl.edu>.
Added lines 12-17:

'''REQUIREMENTS'''

Parallel communication in pgbrt is supported by MPI, so this packages requires a recent distribution of MPI for compilation and execution. Both OpenMPI (http://www.open-mpi.org) and MPICH2 (http://www.mcs.anl.gov/research/projects/mpich2) are supported. If MPI is not already installed on your system, please refer to either distribution for installation instructions. Either distribution will provide the two binaries required below: a compiler wrapper (e.g. mpicxx) and a parallel execution handler (e.g. mpirun).
Added lines 25-34:
'''EXAMPLE'''

Here is a simple usage example. Navigate to the example/ directory in the distribution. The following commands will train a model, cross-validate on validation set ERR, produce a test executable, and evaluate a test set using that executable. Depending on your system, replace mpirun with the appropriate MPI execution handler.
[@mpirun -np 2 ../bin/pgbrt train.dat 70 700 4 100 0.1 \
-V valid.dat -v 30 -m > out.log
cat out.log | python ../scripts/crossval.py 5 -r \
| python ../scripts/compiletest.py test
cat test.dat | ./test > test.pred@]
Changed lines 39-41 from:
[@pgbrt TRAIN_FILE TRAIN_SIZE N_FEATURES DEPTH N_TREES RATE [OPTIONS]@]

[@
TRAIN_FILE training file
to:
[@mpirun -np N_PROCS pgbrt TRAIN_FILE TRAIN_SIZE N_FEATURES DEPTH N_TREES RATE [OPTIONS]

TRAIN_FILE training file
Deleted lines 86-95:


'''EXAMPLE'''

Here is a simple usage example. Navigate to the example/ directory in the distribution. The following commands will train a model, cross-validate on validation set ERR, produce a test executable, and evaluate a test set using that executable. (Alternatively the first command may be replaced by executing run.py.)
[@mpirun -np 2 ../bin/pgbrt train.dat 70 700 4 100 0.1 \
-V valid.dat -v 30 -m > out.log
cat out.log | python ../scripts/crossval.py 5 -r \
| python ../scripts/compiletest.py test
cat test.dat | ./test > test.pred@]
August 30, 2011, at 12:29 PM by 172.16.20.31 -
Changed line 3 from:
This software package supports machine learning with gradient boosted regression tree ensembles. See "INSTALLATION" for instructions to compile a C++ executable for training. See "USAGE" and "SCRIPTS" for a description of command line arguments and helpful Python scripts. See "EXAMPLE" for a simple example usage.
to:
This software package supports machine learning with gradient boosted regression tree ensembles. See "DOWNLOAD" for the source distribution and "INSTALLATION" for instructions to compile a C++ executable for training. Review "USAGE" and "SCRIPTS" for a description of command line arguments and helpful Python scripts. See "EXAMPLE" for a simple example usage.
August 30, 2011, at 12:27 PM by 172.16.20.31 -
Changed lines 40-41 from:
[@iteration,train_rmse,train_err,train_ndcg,valid_rmse,valid_err,valid_ndcg,test_rmse,test_err,test_ndcg@]
to:
[@iteration,train_rmse,train_err,train_ndcg,valid_rmse,valid_err,valid_ndcg,
test_rmse,test_err,test_ndcg@]
August 30, 2011, at 12:26 PM by 172.16.20.31 -
Changed line 9 from:
The source distribution for this package may be downloaded here.
to:
The source distribution for this package may be downloaded [[Attach:pgbrt.tar.gz | here]].
August 30, 2011, at 11:58 AM by 172.16.20.31 -
Changed line 5 from:
Version 0.9, August 2011. For additional documentation and the latest release, please visit <http://machinelearning.wustl.edu/Main/Pgbrt>. Report bugs to <swtyree at wustl.edu>.
to:
Version 0.9, August 2011. Report bugs to <swtyree at wustl.edu>.
August 30, 2011, at 11:57 AM by 172.16.20.31 -
Added lines 1-78:
'''ABOUT'''

This software package supports machine learning with gradient boosted regression tree ensembles. See "INSTALLATION" for instructions to compile a C++ executable for training. See "USAGE" and "SCRIPTS" for a description of command line arguments and helpful Python scripts. See "EXAMPLE" for a simple example usage.

Version 0.9, August 2011. For additional documentation and the latest release, please visit <http://machinelearning.wustl.edu/Main/Pgbrt>. Report bugs to <swtyree at wustl.edu>.

'''DOWNLOAD'''

The source distribution for this package may be downloaded here.

'''INSTALLATION'''

To compile with GCC options, run 'make' in the source directory. Call 'make intel' to compile with options for the Intel ICC compiler. The resulting binary will be written to the bin/ directory.

In many cases it may be necessary to modify the makefile to point to the desired MPI compiler wrapper (typically called mpicxx and located in the MPI installation bin/ directory).


'''USAGE'''

Here are the required and optional arguments to the pgbrt executable.

[@pgbrt TRAIN_FILE TRAIN_SIZE N_FEATURES DEPTH N_TREES RATE [OPTIONS]@]

[@TRAIN_FILE training file
TRAIN_SIZE number of instances in training file
N_FEATURES number of features in data sets
DEPTH maximum regression tree depth
N_TREES number of regression trees to learn
RATE learning rate/stepsize

-V, -E validation/testing files
-v, -e number of instances in validation/testing files
-m compute and print ranking-specific metrics, ERR and NDCG@10
-t print timing information ("#timer EVENT ELAPSED_TIME")
-h show help message@]

Required input includes a training data set (filename, size, number of features) in SVM^rank or SVM^light format and gradient boosting parameters (tree depth, number of trees, learning rate).

Output alternates by line between a depth first traversal of the current regression tree and current metrics computed on data sets. Metrics lines may contain the following items (or fewer depending on command line options).
[@iteration,train_rmse,train_err,train_ndcg,valid_rmse,valid_err,valid_ndcg,test_rmse,test_err,test_ndcg@]


'''SCRIPTS'''

''scripts/run.py'' \\
This script provides a template for calling pgbrt with the required arguments and capturing the output. As provided, this script is configured to train on either Yahoo! LTRC Set 1 or Microsoft LETOR Fold 1. Use one of the following calls (for Yahoo or Microsoft data, respectively) or modify the script as needed.
[@python run.py PATH_TO_EXEC PATH_TO_DATA y NUM_PROCS NUM_TREES
python run.py PATH_TO_EXEC PATH_TO_DATA m NUM_PROCS NUM_TREES@]

''scripts/crossval.py'' \\
This script supports cross-validation on validation metrics computed by pgbrt. The script reads from stdin the trees and metrics output by pgbrt. It prints the trees set of trees selected by cross-validation. Cross-validation is performed on the metric indicated by VAL_METRIC_INDEX, e.g. 3 will cross-validate on the fourth metric in the comma-separated list printed by pgbrt. It is assumed that smaller metric values are better unless the -r option is specified.
[@cat LOG | python crossval.py VAL_METRIC_INDEX [-r]@]

''scripts/compiletest.py'' \\
This script compiles a lightweight C++ test executable for trees read from stdin, e.g. the output of crossval.py. The executable is written to EXEC_FILE. The default compiler command is 'gcc -O3', but an alternative may be supplied as COMPILER.
[@cat TREES | python compiletest.py EXEC_FILE [COMPILER]
cat TEST_FILE | ./EXEC_FILE > PRED_FILE@]

''scripts/evaluate.py'' \\
This script computes RMSE, ERR and NDCG metrics on predictions provided in PRED_FILE using labels and queries found in TEST_FILE.
[@python evaluate.py TEST_FILE PRED_FILE@]

''scripts/evtools.py'' \\
This script was provided by Ananth Mohan in his package rt-rank, which may be found at https://sites.google.com/site/rtranking.
The script incorporates a metrics computation script originally provided by Yahoo! for the Learning to Rank Challenge. The original may be found at http://learningtorankchallenge.yahoo.com/evaluate.py.txt.

Here is the provided description:
"Script to compute the NDCG and ERR scores of a submission. Labels is a list of lines containing the relevance labels (one line per query). Ranks is a list of lines with the predicted ranks (again one line per query). The first integer in a line is the rank --in the predicted ranking-- of the first document, where first refers to the order of the data file. k is the truncation level for NDCG. It returns the mean ERR and mean NDCG."


'''EXAMPLE'''

Here is a simple usage example. Navigate to the example/ directory in the distribution. The following commands will train a model, cross-validate on validation set ERR, produce a test executable, and evaluate a test set using that executable. (Alternatively the first command may be replaced by executing run.py.)
[@mpirun -np 2 ../bin/pgbrt train.dat 70 700 4 100 0.1 \
-V valid.dat -v 30 -m > out.log
cat out.log | python ../scripts/crossval.py 5 -r \
| python ../scripts/compiletest.py test
cat test.dat | ./test > test.pred@]