Biowulf High Performance Computing at the NIH
CANDLE on Biowulf

CANDLE (CANcer Distributed Learning Environment) is an open-source software platform providing deep learning methodologies that scales very efficiently on the world’s fastest supercomputers. Developed initially to address three top challenges facing the cancer community, CANDLE increasingly can be used to tackle problems in other application areas. The SDSI team at the Frederick National Laboratory for Cancer Research, sponsored by the National Cancer Institute, has recently installed CANDLE on NIH’s Biowulf supercomputer for all to use.

One of CANDLE's strongest attributes is its functionality for performing hyperparameter optimization (HPO). In a machine/deep learning model, "hyperparameters" refer to any variables that define the model aside from the model’s "weights." For a given set of hyperparameters (typically 5-20), the corresponding model’s weights (typically tens of thousands) are iteratively optimized using algorithms such as gradient descent. Such optimization of the model’s weights – a process called "training" – is typically run very efficiently on graphics processing units (GPUs) and typically takes 30 minutes to a couple of days.

If a measure of loss is assigned to each model trained on the same set of data, we would like to ultimately choose the model (i.e., set of hyperparameters) that best fits that dataset by minimizing the loss. HPO is this process of choosing the best set of hyperparameters. The most common way of determining the optimal set of hyperparameters is to run one training job for every desired combination of hyperparameters and choose that which produces the lowest loss. Such a workflow is labeled in CANDLE by "grid" (it is called in other contexts "grid search"). Another way of determining the optimal set of hyperparameters is to use a Bayesian approach in which information about how well prior sets of hyperparameters performed is used to select the next sets of hyperparameters to try. This type of workflow is labeled in CANDLE by "bayesian".

HPO need not be used for only machine/deep learning applications; it can be applied to any computational pipeline that can be parametrized by a number of settings. With ever-increasing amounts of data, applications like these, in addition to machine/deep learning applications, are growing at NCI and in the greater NIH community. If HPO is performed, better models for describing relationships between data can be found, and the better the model, the more accurate predictions can be made given new sets of data. CANDLE is here to help with this, and this webpage serves as a complete guide for running CANDLE on Biowulf.

Link: Presentation on how to use CANDLE on Biowulf on 7/18/19 as part of the NIH.AI seminar series

Why Use CANDLE?

Why use CANDLE in the first place? For example, why not just submit a swarm of jobs, each using a different set of hyperparameters?

Quick Start

These steps will get you running a sample CANDLE job on Biowulf right away!

Step 1: Set up your environment

Once logged in to Biowulf, set up your environment by creating and entering a working directory in your /data/$USER (not /home/$USER) directory and loading the candle module (user input in bold):

[user@biowulf]$ mkdir /data/$USER/candle
[user@biowulf]$ cd /data/$USER/candle
[user@biowulf]$ module load candle

Step 2: Copy a template submission script to the working directory

Copy one of the three CANDLE templates (sets of files) to the working directory:

[user@biowulf]$ candle import-template <TEMPLATE>

Possible values of <TEMPLATE> are:

grid Grid search using a Python model (simple deep neural network on the MNIST dataset; ~3 min. total)
bayesian Bayesian search using a Python model (one of the JDACS4C Pilot 1 models, a 1D convolutional network for classifying RNA-seq gene expression profiles into normal or tumor tissue categories; ~24 min. total)
r Grid search using an R model (feature reduction on the TNBC dataset; ~6 min. total)

Step 3: Run the job

Submit the job by running:

[user@biowulf]$ candle submit-job submit_candle_job.sh

Summary of How to Use CANDLE

This section contains a summary of steps for running your own CANDLE job, which are detailed in the following sections.

Adapting Your Model to Work With CANDLE

You can run a CANDLE job (i.e., a hyperparameter optimization) on your own machine/deep learning model or general workflow (call it generally a "model script") by performing two minimal modifications to your model script. Note: Currently CANDLE accepts model scripts written only in Python or R.

Note: Prior to adapting your model script for use with CANDLE, you must ensure it runs standalone on a Biowulf compute node. This can be tested by requesting an interactive GPU node (e.g., sinteractive --gres=gpu:k20x:1 --mem=60G --cpus-per-task=16) and then running the model like, e.g., python my_model.py or Rscript my_model.R; don’t forget to use the correct version of Python or R, if required!

Once you have confirmed your model script runs on Biowulf, modify it in two simple ways:

Step 1: Specify the hyperparameters

Specify the hyperparameters in your code using a variable named hyperparams of the dictionary (Python) or data.frame (R) datatypes. E.g., in Python, if your model script my_model.py contains

n_convolutional_layers = 4
batch_size = 128

but these are parameters that you'd like to change during the CANDLE workflow, you should change those lines to

n_convolutional_layers = hyperparams['nconv_layers']
batch_size = hyperparams['batch_size']

Note: The "key" in the hyperparams dictionary should match the variable names in the DEFAULT_PARAMS_FILE and WORKFLOW_SETTINGS_FILE files (following section), whereas the variables to which they are assigned in the model script should obviously match the names used in the rest of the script.

Likewise, in R, if your model script my_model.R contains

n_convolutional_layers <- 4
batch_size <- 128

you should change those lines to

n_convolutional_layers <- hyperparams[["nconv_layers"]]
batch_size <- hyperparams[["batch_size"]]

Step 2: Define the metric you would like to minimize

If your model is written in Python, either define a Keras history object named history (as in, e.g., the return value of a model.fit() method; validation loss will be minimized), e.g.,

history = model.fit(x_train, y_train, ...)

or define a single number named val_to_return that contains the metric you would like to minimize, e.g.,

score = model.evaluate(x_test, y_test)
val_to_return = score[0]

If your model is written in R, define a single number named val_to_return that contains the metric you would like to minimize, e.g.,

val_to_return <- my_validation_loss
Note on minimization metric

Only the bayesian workflow actually uses the minimization metric since by definition in order for it to determine the next sets of hyperparameters to try it needs a measure of how "well" prior sets of hyperparameters performed.  Since the grid workflow by definition runs training on all sets of hyperparameters regardless of any measure of how "well" prior sets performed, it never actually uses the minimization metric.  However, the val_to_return variable (or the history object in Python) is always required, so when running the grid workflow and you don't care to return any particular result from your model script, simply set it to a dummy value such as -7.

Typical physical values assigned to val_to_return include the training, testing, or validation loss (for a machine/deep learning model) or the workflow runtime (for optimizing workflow runtimes as in, e.g., benchmarking).

Creating Settings Files

There are three files you need to create in order to use CANDLE to run your own model script:

  1. A file containing the default hyperparameter settings, e.g., default_params.txt
  2. A file specifying how you would like to vary the hyperparameters in order to run your model script with different settings, e.g., grid_workflow-1.txt
  3. A CANDLE "submission" script, e.g., submit_candle_job.sh

However, rather than creating these files from scratch, a useful way to use CANDLE for your work is to adapt any of the templates above to your use case. Feel free to run candle import-template <TEMPLATE> with different <TEMPLATE> settings and examine the files that are copied over in order to better understand what they do and the types of settings they can contain.

File 1: Default hyperparameters file

This is a .txt file containing the default hyperparameter settings, some or all of whose values will be overwritten by those specified in the workflow settings file, below. It has the format of a Python configparser configuration file, e.g.:

[Global Params]
epochs = 20
batch_size=128
activation='relu'
optimizer='rmsprop'
num_filters=32 

Every hyperparameter specified in the model script must have a default setting specified here.

File 2: Workflow settings file

This is a .txt file (grid workflow) or a .R file (bayesian workflow) containing how some or all of the hyperparameters defined in the model script (and the default hyperparameters file) are to be varied during their respective HPO workflows. The filename MUST begin with <WORKFLOW_TYPE>_workflow-, where <WORKFLOW_TYPE> is grid or bayesian.

Note: Python’s False, True, and None, should be replaced by JSON’s false, true, and null inside this file.

grid workflow

In the workflow settings file for the grid workflow, each line must be a JSON string specifying the values of the hyperparameters to use in each job, and each string must contain an id key containing a unique name for the hyperparameter set, e.g.:

{"id": "hpset_01", "epochs": 15, "activation": "tanh"}
{"id": "hpset_02", "epochs": 30, "activation": "tanh"}
{"id": "hpset_03", "epochs": 15, "activation": "relu"}
{"id": "hpset_04", "epochs": 30, "activation": "relu"}
{"id": "hpset_05", "epochs": 10, "batch_size": 128}
{"id": "hpset_06", "epochs": 10, "batch_size": 256}
{"id": "hpset_07", "epochs": 10, "batch_size": 512}

Note: This example implies that the epochs, activation, and batch_size hyperparameters must be defined in the default hyperparameters file. It further shows that the full "grid" of values need not be run in the grid workflow; in fact, you can customize by hand every set of hyperparameter values that you'd like to run.

Alternatively, you can use the generate-grid candle command to create a file called grid_workflow-XXXX.txt containing a full "grid" of hyperparameters. The usage is candle generate-grid <PYTHON-LIST-1> <PYTHON-LIST-2> ..., where each <PYTHON-LIST> is a Python list whose first element is a string containing the hyperparameter name and the second argument is an iterable of hyperparameter values (numpy functions can be accessed using the np variable). For example, running

[user@biowulf]$ candle generate-grid "['nlayers',np.arange(5,15,2)]" "['dir',['x','y','z']]"

will create a file called grid_workflow-XXXX.txt with the contents

{"id": "hpset_00001", "nlayers": 5, "dir": "x"}
{"id": "hpset_00002", "nlayers": 5, "dir": "y"}
{"id": "hpset_00003", "nlayers": 5, "dir": "z"}
{"id": "hpset_00004", "nlayers": 7, "dir": "x"}
{"id": "hpset_00005", "nlayers": 7, "dir": "y"}
{"id": "hpset_00006", "nlayers": 7, "dir": "z"}
{"id": "hpset_00007", "nlayers": 9, "dir": "x"}
{"id": "hpset_00008", "nlayers": 9, "dir": "y"}
{"id": "hpset_00009", "nlayers": 9, "dir": "z"}
{"id": "hpset_00010", "nlayers": 11, "dir": "x"}
{"id": "hpset_00011", "nlayers": 11, "dir": "y"}
{"id": "hpset_00012", "nlayers": 11, "dir": "z"}
{"id": "hpset_00013", "nlayers": 13, "dir": "x"}
{"id": "hpset_00014", "nlayers": 13, "dir": "y"}
{"id": "hpset_00015", "nlayers": 13, "dir": "z"}

Note: The candle module must be loaded in order to run any of the candle commands such as generate-grid. Also, feel free to overwrite the "XXXX" in the filename that's generated.

A more complete example producing a 600-line file (600 sets of hyperparameters) is

[user@biowulf]$ candle generate-grid "['john',np.arange(5,15,2)]" "['single_num',[4]]" "['letter',['x','y','z']]" "['arr',[[2,2],None,[2,2,2],[2,2,2,2]]]" "['smith',np.arange(-1,1,0.2)]"

No spaces can be present in any of the arguments to the generate-grid command.

bayesian workflow

The workflow settings file for the bayesian workflow must contain a variable named param.set that uses the makeParamSet function in the ParamHelpers R package to return a list of parameters. Each argument to makeParamSet() is a special constructor function defining the values that each hyperparameter can take on during the HPO workflow. For example, the bayesian template contains a file called bayesian_workflow-nt3_nightly.R with the contents

param.set <- makeParamSet(
makeDiscreteParam("batch_size", values = c(16, 32)),
makeIntegerParam("epochs", lower = 2, upper = 5),
makeDiscreteParam("optimizer", values = c("adam", "sgd", "rmsprop", "adagrad", "adadelta")),
makeNumericParam("drop", lower = 0, upper = 0.9),
makeNumericParam("learning_rate", lower = 0.00001, upper = 0.1)
)

which defines the possible values that the hyperparameters batch_size, epochs, optimizer, drop, and learning_rate can take on during the running of the bayesian workflow. Please see the Param help page for individual usage of each type of constructor function.

Please see the mlrMBO package documentation for more information on how the bayesian workflow works in CANDLE.

File 3: CANDLE "submission" script

You only need to modify six settings inside the submission script. All variables should be preceded by an export command. Please use full pathnames (e.g., /path/to/file.py instead of file.py) and examine the sample settings below to better understand their meaning.

Required variables
MODEL_SCRIPT
This should point to the Python or R script that you would like to run. E.g., export MODEL_SCRIPT=”/data/$USER/candle/mnist.py". This script must have been adapted to work with CANDLE (see the previous section). The filename extension will automatically determine whether Python or R will be used to run the model.
DEFAULT_PARAMS_FILE
Default settings for the hyperparameters defined in the model. E.g., export DEFAULT_PARAMS_FILE="/data/$USER/candle/mnist_default_params.txt".
WORKFLOW_SETTINGS_FILE
This file contains the settings parametrizing the workflow you would like to run. E.g., export WORKFLOW_SETTINGS_FILE="/data/$USER/candle/grid_workflow-mnist.txt". Again, the filename MUST begin with <WORKFLOW_TYPE>_workflow-, where <WORKFLOW_TYPE> is grid (.txt file) or bayesian (.R file).
NGPUS
Number of GPUs you would like to use for the CANDLE job. E.g., export NGPUS=2. Note: One (grid workflow) or two (bayesian workflow) extra GPUs will be allocated in order run background processes.
GPU_TYPE
Type of GPU you would like to use. E.g., export GPU_TYPE="k80". The choices on Biowulf are k20x, k80, p100, and v100.
WALLTIME
How long you would like your job to run (the wall time of your entire job including all hyperparameter sets). E.g., export WALLTIME="00:20:00". Format is HH:MM:SS. When in doubt, round up so that the job is most likely to complete (if it doesn't, use the RESTART_FROM_EXP setting, below.
Optional variables

Python models only

PYTHON_BIN_PATH
If you don’t want to use the Python version with which CANDLE was built (currently python/3.6), you can set this to the location of the Python binary you would like to use. Examples:
export PYTHON_BIN_PATH="$CONDA_PREFIX/envs/<YOUR_CONDA_ENVIRONMENT_NAME>/bin"
export PYTHON_BIN_PATH="/data/BIDS-HPC/public/software/conda/envs/main3.6/bin"
If set, it will override the setting of EXEC_PYTHON_MODULE, below.
EXEC_PYTHON_MODULE
If you’d prefer loading a module rather than specifying the path to the Python binary (above), set this to the name of the Python module you would like to load. E.g., export EXEC_PYTHON_MODULE="python/2.7". This setting will have no effect if PYTHON_BIN_PATH (above) is set. If neither PYTHON_BIN_PATH nor EXEC_PYTHON_MODULE is set, then the version of Python with which CANDLE was built (currently python/3.6) will be used.
SUPP_PYTHONPATH
This is a supplementary setting of the PYTHONPATH variable that will be searched for libraries that can’t otherwise be found. Examples:
export SUPP_PYTHONPATH="/data/BIDS-HPC/public/software/conda/envs/main3.6/lib/python3.6/site-packages"
export SUPP_PYTHONPATH="/data/$USER/conda/envs/my_conda_env/lib/python3.6/site-packages"

Note: Multiple paths can be set by separating them with a colon.

R models only

EXEC_R_MODULE
If you don’t want to use the R version with which CANDLE was built (currently R/3.5.0), set this to the name of the R module you would like to load. E.g., export EXEC_R_MODULE="R/3.6".
SUPP_R_LIBS
This is a supplementary setting of the R_LIBS variable that will be searched for libraries that can’t otherwise be found. E.g., export SUPP_R_LIBS="/data/BIDS-HPC/public/software/R/3.6/library". Note: R will search your standard library location on Biowulf (~/R/%v/library), so feel free to just install your own R libraries there.

Models written in either language

SUPP_MODULES
Modules you would like to have loaded while your model is run. E.g., export SUPP_MODULES="CUDA/10.0 cuDNN/7.5/CUDA-10.0" (these particular example settings are necessary for running TensorFlow when using a custom Conda installation).
EXTRA_SCRIPT_ARGS
Command-line arguments you’d like to include when invoking python or Rscript. E.g., for R model scripts, export EXTRA_SCRIPT_ARGS="--max-ppsize=100000". In other words, the model will ultimately be run like python $EXTRA_SCRIPT_ARGS my_model.py or Rscript $EXTRA_SCRIPT_ARGS my_model.R.
RESTART_FROM_EXP
If a grid workflow was run previously but for whatever reason did not complete (such as a too-low setting of WALLTIME), here you can specify the name of the experiment from which to resume. E.g., export RESTART_FROM_EXP="X002".
USE_CANDLE
Whether to use CANDLE to run a workflow (1, default) or to simply run the model on the default set of hyperparameters specified by DEFAULT_PARAMS_FILE (0). E.g., export USE_CANDLE=0. If set to 0, use an interactive node (e.g., sinteractive --gres=gpu:k20x:1 --mem=60G --cpus-per-task=16); rather than submitting a job to the batch queue, the job will run on the current node.

bayesian workflow only

DESIGN_SIZE
Total number of points to sample within the hyperparameter space prior to running the mlrMBO algorithm. E.g., export DESIGN_SIZE=9 (default 10). Note that this must be greater than or equal to the largest number of possible values for any discrete hyperparameter specified in the workflow settings file.
PROPOSE_POINTS
Number of proposed / really evaluated points each MBO iteration. E.g., export PROPOSE_POINTS=9 (default 10).
MAX_BUDGET
Maximum total number of function evaluations for all iterations combined. E.g., export MAX_BUDGET=180 (default 110).
MAX_ITERATIONS
Maximum number of sequential optimization steps. E.g., export MAX_ITERATIONS=3 (default 10).


The submission script should begin with #!/bin/bash, end with $CANDLE/wrappers/templates/scripts/run_workflows.sh, and should be called like

[user@biowulf]$ candle submit-job <SUBMISSION-SCRIPT>

Note: When USE_CANDLE=1 (default behavior), the submission script will automatically request an sbatch job; the script should not be called using sbatch and should always be run from Biowulf (as opposed to an interactive [compute] node). When USE_CANDLE=0, the submission script should be called the the same way, but from an interactive node; this is the best way to test your job (with the default hyperparameter settings) without actually running a CANDLE workflow.

Aggregating CANDLE Job Results

After a CANDLE job is complete, the results of all jobs run on each set of hyperparameters will be placed in a subdirectory of the experiments directory, which will be created in the directory from which the job was submitted. A symbolic link called last-exp in the same level as the experiments directory will point to the latest experiment that was run.

Inside one of the experiments subdirectories will be the run directory, which will contain one subdirectory per hyperparameter set containing the results of the model script run using that hyperparameter set.

For example, here is a sample directory structure expanding one of the CANDLE experiments directories (X002):

.
├── experiments
│   ├── X000
│   ├── X001
│   └── X002
│       ├── cfg-sys-biowulf.sh
│       ├── grid_workflow-mnist.txt
│       ├── jobid.txt
│       ├── metadata.json
│       ├── output.txt
│       ├── run
│       │   ├── hpset_01
│       │   ├── hpset_02
│       │   ├── hpset_03
│       │   ├── hpset_04
│       │   ├── hpset_05
│       │   ├── hpset_06
│       │   └── hpset_07
│       ├── submit.sh
│       ├── turbine.log
│       ├── turbine-slurm.sh
│       ├── workflow.sh.log
│       └── workflow.tic
├── last-exp -> /data/doeja/candle/experiments/X002
└── submit_candle_job.sh

In order to collect the values of all the hyperparameter sets as well as the resulting metric for each set, run the aggregate-results command to candle:

[user@biowulf]$ candle aggregate-results <EXP-DIR> [<RESULT-FORMAT>]

where <EXP-DIR> is the experiment directory, i.e., that containing the run directory, and <RESULT-FORMAT> is an optional string containing the standard printf()-formatted string containing the output format for the metric. For example, if the r template/example were run inside the /data/$USER/candle directory, then running

[user@biowulf]$ candle aggregate-results /data/$USER/candle/last-exp

would produce a file called candle_results.csv in the current directory containing the data from all the jobs, sorted by increasing metric value, e.g.,

result,dirname,id,mincorr,maxcorr,number_cv,extfolds
000.796,hpset_00001,hpset_00001,0.200000,0.80,2,5
000.796,hpset_00004,hpset_00004,0.200000,0.80,5,5
000.837,hpset_00002,hpset_00002,0.200000,0.80,3,5
000.878,hpset_00003,hpset_00003,0.200000,0.80,4,5
000.905,hpset_00007,hpset_00007,0.200000,0.80,8,5
000.964,hpset_00005,hpset_00005,0.200000,0.80,6,5
000.964,hpset_00006,hpset_00006,0.200000,0.80,7,5
001.000,hpset_00008,hpset_00008,0.200000,0.80,9,5

This file can be further processed using Excel or any other method in order to study the results of the HPO.

Note: Since the field names (first line above) are extracted once, if the hyperparameters that are modified are not the same for every set of hyperparameters run using the grid workflow, then the results of the aggregate-results command will not make sense. For example, running this command on the results of the grid template/example will produce

result,dirname,id,epochs,activation
000.064,hpset_01,hpset_01,15,tanh
000.066,hpset_07,hpset_07,10,512
000.074,hpset_06,hpset_06,10,256
000.080,hpset_02,hpset_02,30,tanh
000.081,hpset_05,hpset_05,10,128
000.098,hpset_03,hpset_03,15,relu
000.121,hpset_04,hpset_04,30,relu

As usual, the full pathname must be used for <EXP-DIR>.

Summary of candle Commands

As long as the candle module is loaded (module load candle), the available commands to the candle program (in the format candle <COMMAND> <COMMAND-ARG-1> <COMMAND-ARG-2> ...) are as follows:

candle import-template <TEMPLATE> Copy CANDLE template files to the current directory
candle generate-grid <PYTHON-LIST-1> <PYTHON-LIST-2> ... Generate a hyperparameter grid for the grid search workflow
candle submit-job <SUBMISSION-SCRIPT> Submit a CANDLE submission script
candle aggregate-results <EXP-DIR> [<RESULT-FORMAT>] Create a CSV file called candle_results.csv containing the hyperparameters and corresponding evaluation metrics

Note: Leaving <COMMAND> blank or setting it to help will display this usage menu.

Promoting CANDLE and Your Work

If you've successfully used CANDLE to advance your work and you're willing to tell us about it, please email the SDSI team to tell us what you've done! We'd love to learn how users are using CANDLE to address their needs so that we can continue to improve CANDLE and its implementation on Biowulf.

Further, if you're willing to have your work promoted online, please include an exemplary graphic of your work, and upon review we'll post it here as an exemplar CANDLE success story.  More exposure for you, more exposure for us!

Or, if you've unsuccessfully used CANDLE to advance your work, we'd love to help you out; please let us know what didn't work for you!

Contact Information

Feel free to email the SDSI team with any questions, comments, or suggestions.

For FAQ (coming soon), notices, links, and updates, please go to https://cbiit.github.com/sdsi/candle.

Finally, our team has expertise in building machine/deep learning models for a variety of situations (e.g., image segmentation, classification from RNA-seq data, etc.) and would be happy to help you build a model (independent of CANDLE) or point you in the right direction. (And, we are happy to collaborate!)