CANDLE on Biowulf

Quick Links

Presentation Materials

Why CANDLE?

Quick Start

Usage Summary

Confirming a Working Model

Adapting Your Model

Creating the Input File

CANDLE (CANcer Distributed Learning Environment) is an open-source software platform providing deep learning methodologies that scales very efficiently on the world’s fastest supercomputers. Developed initially to address three top challenges facing the cancer community, CANDLE increasingly can be used to tackle problems in other application areas. The SDSI team at the Frederick National Laboratory for Cancer Research, sponsored by the National Cancer Institute, has installed CANDLE on NIH’s Biowulf supercomputer for all with Biowulf access to use.

One of CANDLE's primary attributes is its functionality for performing hyperparameter optimization (HPO). In a machine/deep learning model, "hyperparameters" refer to any variables that define the model aside from the model’s "weights." For a given set of hyperparameters (typically 5-20), the corresponding model’s weights (typically tens of thousands) are iteratively optimized using algorithms such as gradient descent. Such optimization of the model’s weights – a process called "training" – is typically run very efficiently on graphics processing units (GPUs) when the model is a neural network (deep learning) and typically takes 30 minutes to a couple of days.

If a measure of loss is assigned to each model trained on the same set of data, we would like to ultimately choose the model (i.e., set of hyperparameters) that best fits that dataset by minimizing the loss. HPO is this process of choosing the best set of hyperparameters. The most common way of determining the optimal set of hyperparameters is to run one training job for every desired combination of hyperparameters and choose that which produces the lowest loss. Such a workflow is labeled in CANDLE by "grid" (it is called in other contexts "grid search"). Another way of determining the optimal set of hyperparameters is to use a Bayesian approach in which information about how well prior sets of hyperparameters performed is used to select the next sets of hyperparameters to try. This type of workflow is labeled in CANDLE by "bayesian".

Finally, HPO need not be used for only machine/deep learning applications; it can be applied to any computational pipeline that can be parametrized by a number of settings. This web page serves as a complete guide to running CANDLE on Biowulf.

Presentation Materials

Click here for the slides and here for the recording of the talk Hyperparameter Optimization Using CANDLE on Biowulf presented on 1/19/21 by the NCI Data Science Learning Exchange.

Why Use CANDLE?

Why use CANDLE in the first place? For example, why not just submit a swarm of jobs, each using a different set of hyperparameters?

Load balancing. CANDLE uses a program called Swift/T to ensure the resources (CPUs or GPUs) allocated to you by Biowulf's batch system (SLURM) are used as efficiently as possible with minimal downtime. Since often the jobs will not take the same or even similar amounts of time, using a utility like Swarm could lead to significant downtime on some of the allocated resources. CANDLE, and Swift/T in particular, will submit a job to a resource only if the resource is ready to take another job. This way, there will be minimal downtime on any of the resources even if the individual jobs take different amounts of time.
Intelligent hyperparameter selection. Further, if the bayesian workflow is selected, the sets of hyperparameters to run need not be known beforehand; only the space of hyperparameters need be specified, and only sets of hyperparameters that the Bayesian algorithm determines will most likely minimize a particular metric will be run. In other words, CANDLE allows you to intelligently generate the best sets of hyperparameters to try on-the-fly.
Ready-to-use framework. Our implementation of CANDLE on Biowulf requires that you do the absolute bare-minimum needed to run hyperparameter optimizations. CANDLE has been, and continues to be, actively developed and perfected to perform at a high level on large HPC systems such as Biowulf.

Quick Start

These steps will get you running a sample CANDLE job on Biowulf right away!

Step 1: Set up your environment

Once logged in to Biowulf, set up your environment by creating and entering a working directory in your /data/$USER (not /home/$USER) directory and loading the candle module (user input in bold):

[user@biowulf]$ mkdir /data/$USER/candle
[user@biowulf]$ cd /data/$USER/candle
[user@biowulf]$ module load candle

Step 2: Copy a template submission script to the working directory

Copy one of the four CANDLE templates to the working directory:

[user@biowulf]$ candle import-template <TEMPLATE>

Possible values of <TEMPLATE> are:

`grid`	Grid search using a Python model (simple deep neural network on the MNIST dataset; ~5 min. total runtime)
`bayesian`	Bayesian search using a Python model (one of the JDACS4C Pilot 1 models, a 1D convolutional network for classifying RNA-Seq gene expression profiles into normal or tumor tissue categories; ~40 min. total runtime)
`r`	Grid search using an R model (feature reduction on the TNBC dataset; ~5 min. total runtime)
`bash`	Grid search using a bash model (this is a bash wrapper around the `grid` example above, which is a simple deep neural network on the MNIST dataset; ~5 min. total runtime)

Step 3: Run the job

Submit the job by running:

[user@biowulf]$ candle submit-job <TEMPLATE>_example.in

Summary of How to Use CANDLE

This section contains a summary of steps for running your own CANDLE job, which are detailed in the following sections.

Ensure your model script already works on a Biowulf compute node. The model must be written in Python, R, or bash.
Adapt your model script to work with CANDLE. Only two, minor modifications need be made:

Specify the hyperparameters using the candle_params dictionary (Python) or data.frame (R).
Specify a return value using the candle_value_to_return variable (or Keras history object if using Python+Keras).

The bash example is a bit more involved; see the example or email us for assistance.

Load the candle module: module load candle. Among other things, this sets the value of the $CANDLE environment variable.
Create a single CANDLE input file. This is easiest done by modifying one of the template input files that can be imported using candle import-template {grid,bayesian,r,bash}.
Confirm your model runs using CANDLE without running a full workflow. Request an interactive node from SLURM, set run_workflow=0 in the &control section of the input file, and run the model script using candle submit-job <INPUT-FILE> as usual.
Submit the CANDLE job using candle submit-job <INPUT-FILE>. Ensure this is done from the /data/$USER directory (as opposed to /home/$USER).
Collect the job results using candle aggregate-results <EXP-DIR> [<RESULT-FORMAT>].

Ensuring Your Model Already Works Standalone

Prior to adapting your own model script (i.e., machine/deep learning model or general workflow) for use with CANDLE, you must ensure it runs standalone on a Biowulf compute node. Skipping this step is the most common error new CANDLE users make. If your model does not work outside of CANDLE, you cannot expect it to work using CANDLE!

See the Biowulf user guide for information on running scripts on Biowulf. For example, you can test a model that utilizes a GPU by requesting an interactive GPU node (e.g., sinteractive --gres=gpu:k80:1 --mem=20G) and then running the model like, e.g., python my_model_script.py or Rscript my_model_script.R; don’t forget to use the correct version of Python or R, if required!

Adapting Your Model to Work With CANDLE

Once you have confirmed that your model script runs as-is on Biowulf, modify it in two simple ways. Note that while CANDLE accepts model scripts written in Python, R, or bash, the following addresses model scripts written in only Python or R; as bash model scripts are more involved, see the bash example or email us for assistance.

Step 1: Specify the hyperparameters

Specify the hyperparameters in your code using a variable named candle_params of the dictionary (Python) or data.frame (R) datatypes. E.g., in Python, if your model script my_model_script.py contains

n_convolutional_layers = 4
batch_size = 128

but these are parameters that you'd like to change during the CANDLE workflow, you should change those lines to

n_convolutional_layers = candle_params['nconv_layers']
batch_size = candle_params['batch_size']

Note: The "key" in the candle_params dictionary should match the variable names in the CANDLE input file (following section), whereas the variables to which they are assigned in the model script should obviously match the names used in the rest of the script.

Likewise, in R, if your model script my_model_script.R contains

n_convolutional_layers <- 4
batch_size <- 128

you should change those lines to

n_convolutional_layers <- candle_params[["nconv_layers"]]
batch_size <- candle_params[["batch_size"]]

Step 2: Define the metric you would like to minimize

If your model is written in Python, either define a Keras history object named history (as in, e.g., the return value of a model.fit() method; validation loss will be minimized), e.g.,

history = model.fit(x_train, y_train, validation_data=(x_val, y_val), ...)

or define a single number named candle_value_to_return that contains the value you would like to minimize, e.g.,

score = model.evaluate(x_test, y_test)
candle_value_to_return = score[0]

Note: Assuming you have named your Keras model model as in the example above, if you are using as your minimization metric the return value from model.fit() (as opposed to using candle_value_to_return), you must specify the validation_data keyword in the call to model.fit() as shown above. This way the history attribute of model.fit()'s return value will contain a key called val_loss, which is the metric that CANDLE will use to evaluate the current set of hyperparameters. (Choosing the best set of hyperparameters based on a holdout dataset such as a validation dataset is good practice anyway!) (If your model still doesn't seem to generate a val_loss key, try adding metrics=['accuracy'] in the call to model.compile().)

If your model is written in R, define a single number named candle_value_to_return that contains the metric you would like to minimize, e.g.,

candle_value_to_return <- my_validation_loss

Note on minimization metric

Only the bayesian workflow actually uses the minimization metric since by definition in order for it to determine the next sets of hyperparameters to try it needs a measure of how "well" prior sets of hyperparameters performed. Since the grid workflow by definition runs training on all sets of hyperparameters regardless of any measure of how "well" prior sets performed, it never actually uses the minimization metric. However, the candle_value_to_return variable (or the history object in Python) is always required, so when running the grid workflow and you don't care to return any particular result from your model script, simply set it to a dummy value such as -7. However, please keep in mind that the minimization metric will be utilized if using the aggregate-results command to candle to automatically inspect your CANDLE job; it is still good practice to define a minimization metric in your model.

Typical physical values assigned to candle_value_to_return include the training, testing, or validation loss (for a machine/deep learning model) or the workflow runtime (for optimizing workflow runtimes as in, e.g., benchmarking).

Creating the CANDLE Input File

In order to use CANDLE to run your own model script, you need to create an input file containing three sections:

A &control section containing general settings
A &default_model section containing the default hyperparameter values
A &param_space section specifying the space of possible values of the hyperparameters

The input file should have a .in extension, though it is not required. A typical input file looks like:

&control
model_script="$(pwd)/mnist_mlp.py"
workflow="grid"
ngpus=2
gpu_type="k80" # this is a sample inline comment
walltime="00:20:00"
run_workflow = 1
/

&default_model
epochs=20
batch_size=128
activation='relu'
optimizer='rmsprop'
num_filters=32
/

# This is a sample full-line comment
&param_space
{"id": "hpset_01", "epochs": 15, "activation": "tanh"}
{"id": "hpset_02", "epochs": 30, "activation": "tanh"}
{"id": "hpset_03", "epochs": 15, "activation": "relu"}
{"id": "hpset_04", "epochs": 30, "activation": "relu"}
{"id": "hpset_05", "epochs": 10, "batch_size": 128}
{"id": "hpset_06", "epochs": 10, "batch_size": 256}
{"id": "hpset_07", "epochs": 10, "batch_size": 512}
/

Each section must be preceded by the section name (preceded with an ampersand symbol &) on a separate line and followed by a forward slash / on a separate line. The sections can appear in any order and their names must be one of control, default_model, or param_space.

Comments preceded by pound signs # are allowed, either on part of a line or on the entire line, just as in the bash programming language; see the sample input file above.

The three sections of the input file are explained in more detail below. The first two (&control and &default_model) consist of settings of the format left-hand-side = right-hand-side. Spaces on either side of the equals sign = do not matter. The third section (&param_space) has a different format depending on whether the grid or bayesian workflows are specified by the workflow setting in the &control section.

In general, files should always use absolute paths, e.g., /path/to/myfile.ext instead of myfile.ext. In order to enforce this for files present in the same directory from which candle submit-job <INPUT-FILE> is called, you can use $(pwd) as the path in the &control section, e.g., $(pwd)/myfile.ext.

While not necessary, strings may be quoted (err on the side of double quotes "). Finally, whitespace at the beginnings of the lines making up the section bodies does not have any effect aside from making the input file easier to read.

Tip: A useful way to remember the section names, format, and typical settings is to adapt any of the templates above (i.e., grid, bayesian, r, or bash) to your use case. Feel free to run candle import-template <TEMPLATE> with different <TEMPLATE> settings and examine the input file that is copied over in order to better understand what it does and the types of settings it can contain.

Section 1: `&control`

Notes: String settings in this section only can access calls to bash, e.g., model_script = ”/data/$USER/candle/mnist.py" or model_script="$(pwd)/mnist_mlp.py". Internally, all settings in this section are converted to uppercase bash variables prepended by "CANDLE_", e.g., the value assigned to the model_script setting actually gets assigned to the bash variable $CANDLE_MODEL_SCRIPT.

Models written in either language

model_script (Required): This should point to the Python, R, or bash script that you would like to run. E.g., model_script = ”/data/$USER/candle/mnist.py". This script must have been adapted to work with CANDLE (see the previous section). The filename extension will automatically determine whether Python, R, or bash will be used to run the model.
workflow (Required): Which CANDLE workflow to use. Currently supported are the grid and bayesian workflows. E.g., workflow = "grid".
worker_type: Either cpu or type of GPU (k20x, k80, p100, v100, or v100x) you would like to use to run each set of hyperparameters on your model_script, the number of which will be specified by the nworkers keyword below. Default is k80.
nworkers: Number of workers (CPUs or GPUs depending on the setting of worker_type above) you would like to use for the CANDLE job. E.g., nworkers = 2. Default is 1. Note: One (grid workflow) or two (bayesian workflow) extra CPU processes will be allocated in order run background processes.
nthreads: The number of CPUs to use per MPI task. Default is 1.
walltime: How long you would like your job to run (the wall time of your entire job including all hyperparameter sets). E.g., walltime = "00:20:00". Format is HH:MM:SS. When in doubt, round up so that the job is most likely to complete. Default is 00:05:00.
custom_sbatch_args: Custom arguments to SLURM's batch processor, sbatch, such as an lscratch setting. Default is empty.
mem_per_cpu: Memory in GB to request from SLURM per CPU process. Default is 7.
supp_modules: Modules you would like to have loaded while your model is run. E.g., supp_modules = "CUDA/10.0 cuDNN/7.5/CUDA-10.0" (these particular example settings may be necessary for running TensorFlow when using a custom Conda installation). Default is empty.
extra_script_args: Command-line arguments you’d like to include when invoking python or Rscript. E.g., for R model scripts, extra_script_args = "--max-ppsize=100000". In other words, the model will ultimately be run like python $EXTRA_SCRIPT_ARGS my_model_script.py or Rscript $EXTRA_SCRIPT_ARGS my_model_script.R. Default is empty.
run_workflow: Whether (1, default) or not (0) to run the actual workflow specified by the workflow keyword. If not set, a single run of the model using the hyperparameters specified in the &default_model section of the input file will be run on the current machine, so run_workflow=0 must be run only on an interactive node. Testing your model first with run_workflow=0 is a crucial part of the CANDLE testing procedure; see below.
dry_run: Whether (1) or not (0, default) to simply set up the CANDLE job without actually submitting the job to SLURM. This allows you to study the automatically generated files and settings if you suspect something in CANDLE proper is going awry.

bayesian workflow only

design_size: Total number of points to sample within the hyperparameter space prior to running the mlrMBO algorithm. E.g., design_size = 9 (default 10). Note that this must be greater than or equal to the largest number of possible values for any discrete hyperparameter specified in the &param_space section. A reasonable value for this (and for propose_points, below) is 15-20.
propose_points: Number of proposed (really evaluated) points at each MBO iteration. E.g., propose_points = 9 (default 10). A reasonable value for this (and for design_size, above) is 15-20.
max_budget: Maximum total number of function evaluations for all iterations combined. E.g., max_budget = 180 (default 110).
max_iterations: Maximum number of sequential optimization steps. E.g., max_iterations = 3 (default 10).

Python models only

python_bin_path: If you don’t want to use the Python version with which CANDLE was built (currently python/3.7), you can set this to the location of the Python binary you would like to use. Examples:

python_bin_path = "$CONDA_PREFIX/envs/<YOUR_CONDA_ENVIRONMENT_NAME>/bin"
python_bin_path = "/data/BIDS-HPC/public/software/conda/envs/main3.6/bin"; If set, it will override the setting of exec_python_module, below. Default is empty.
exec_python_module: If you’d prefer loading a module rather than specifying the path to the Python binary (above), set this to the name of the Python module you would like to load. E.g., exec_python_module = "python/3.8". This setting will have no effect if python_bin_path (above) is set. If neither python_bin_path nor exec_python_module is set, then the version of Python with which CANDLE was built (currently python/3.7) will be used. Default is empty.
supp_pythonpath: This is a supplementary setting of the $PYTHONPATH environment variable that will be searched for libraries that can’t otherwise be found. Examples:; supp_pythonpath = "/data/BIDS-HPC/public/software/conda/envs/main3.6/lib/python3.6/site-packages"
supp_pythonpath = "/data/$USER/conda/envs/my_conda_env/lib/python3.6/site-packages"

Default is empty.
Tip: Multiple paths can be set by separating them with a colon.
dl_backend: Deep learning library to use. E.g., dl_backend = "pytorch". Should be either keras (default) or pytorch. Only required if deep learning using Keras or PyTorch is requested; e.g., this is not used if only machine learning using scikit-learn is employed. Note: This keyword is probably unnecessary in all cases, e.g., even setting it to keras when PyTorch is actually used does not seem to matter.

R models only

exec_r_module: If you don’t want to use the R version with which CANDLE was built (currently R/4.0.0), set this to the name of the R module you would like to load. E.g., exec_r_module = "R/3.6.0". Default is empty.
supp_r_libs: This is a supplementary setting of the $R_LIBS environment variable that will be searched for libraries that can’t otherwise be found. E.g., supp_r_libs = "/data/BIDS-HPC/public/software/R/3.6/library". Default is empty. Tip: R will search your standard library location on Biowulf (~/R/%v/library), so feel free to just install your own R libraries there.

Section 2: `&default_model`

This section contains the default settings of the hyperparameters defined in the model script, some or all of whose values will be overwritten by those specified in the &param_space section, below. Every hyperparameter specified in the model script must have a default setting specified here.

This section should be otherwise self-explanatory from the sample &default_model section above.

Tip: This is a great place to define constants in your model script (such as a URL from where the training data should be downloaded), rather than hardcoding them in to the model script. E.g., you can replace the line

DATA_URL = 'http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/combo/'

in your Python model script with

DATA_URL = candle_params['data_url']

and place the line

data_url = 'http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/combo/'

in the &default_model section of the CANDLE input file. This way, all settings can be changed from a single input file.

Note: If you wish, you can replace the contents of this section with the full path to a file as a value to the candle_default_model_file keyword, e.g., candle_default_model_file = $CANDLE/Benchmarks/Pilot1/NT3/nt3_default_model.txt. The contents of this file should start with a line containing [Global_Params].

Section 3: `&param_space`

This section contains how some or all of the hyperparameters defined in the model script (and the &default_model section) are to be varied during a hyperparameter optimization workflow.

Note: If you wish, you can replace the contents of this section with the full path to a file as a value to the candle_param_space_file keyword, e.g., candle_param_space_file = $CANDLE/Supervisor/workflows/mlrMBO/data/nt3_nightly.R.

`grid` workflow

The grid workflow refers to a "grid search" hyperparameter optimization in which generally the hyperparameters are varied evenly throughout a specified parameter space.

In the &param_space section for this workflow, each line must be a JSON string specifying the values of the hyperparameters to use in each job, and each string must contain an id key containing a unique name for the hyperparameter set, e.g.:

{"id": "hpset_01", "epochs": 15, "activation": "tanh"}
{"id": "hpset_02", "epochs": 30, "activation": "tanh"}
{"id": "hpset_03", "epochs": 15, "activation": "relu"}
{"id": "hpset_04", "epochs": 30, "activation": "relu"}
{"id": "hpset_05", "epochs": 10, "batch_size": 128}
{"id": "hpset_06", "epochs": 10, "batch_size": 256}
{"id": "hpset_07", "epochs": 10, "batch_size": 512}

Note: This example implies that the epochs, activation, and batch_size hyperparameters must be defined in the &default_model section. It further shows that the full "grid" of values need not be run in the grid workflow; in fact, you can customize by hand every set of hyperparameter values that you'd like to run.

Note: Python’s False, True, and None, should be replaced by JSON’s false, true, and null in the &param_space section for the grid workflow.

Alternatively, you can use the generate-grid candle command to create a file called hyperparameter_grid.txt (inside a directory called candle_generated_files) containing a full "grid" of hyperparameters. The usage is candle generate-grid <PYTHON-LIST-1> <PYTHON-LIST-2> ..., where each <PYTHON-LIST> is a Python list whose first element is a string containing the hyperparameter name and the second argument is an iterable of hyperparameter values (numpy functions can be accessed using the np variable). For example, running

[user@biowulf]$ candle generate-grid "['nlayers',np.arange(5,15,2)]" "['dir',['x','y','z']]"

will create a file called hyperparameter_grid.txt with the contents

{"id": "hpset_00001", "nlayers": 5, "dir": "x"}
{"id": "hpset_00002", "nlayers": 5, "dir": "y"}
{"id": "hpset_00003", "nlayers": 5, "dir": "z"}
{"id": "hpset_00004", "nlayers": 7, "dir": "x"}
{"id": "hpset_00005", "nlayers": 7, "dir": "y"}
{"id": "hpset_00006", "nlayers": 7, "dir": "z"}
{"id": "hpset_00007", "nlayers": 9, "dir": "x"}
{"id": "hpset_00008", "nlayers": 9, "dir": "y"}
{"id": "hpset_00009", "nlayers": 9, "dir": "z"}
{"id": "hpset_00010", "nlayers": 11, "dir": "x"}
{"id": "hpset_00011", "nlayers": 11, "dir": "y"}
{"id": "hpset_00012", "nlayers": 11, "dir": "z"}
{"id": "hpset_00013", "nlayers": 13, "dir": "x"}
{"id": "hpset_00014", "nlayers": 13, "dir": "y"}
{"id": "hpset_00015", "nlayers": 13, "dir": "z"}

Note: The candle module must be loaded in order to run any of the candle commands such as generate-grid.

The contents of the file hyperparameter_grid.txt should then be placed in the body of the &param_space section of the input file, or, as mentioned above, the file could be pointed to by a candle_param_space_file keyword setting in the body of this section.

A more complete example producing a 600-line file (600 sets of hyperparameters) is

[user@biowulf]$ candle generate-grid "['john',np.arange(5,15,2)]" "['single_num',[4]]" "['letter',['x','y','z']]" "['arr',[[2,2],None,[2,2,2],[2,2,2,2]]]" "['smith',np.arange(-1,1,0.2)]"

No spaces can be present in any of the arguments to the generate-grid command.

Note: Use Python’s False, True, and None if using the generate-grid command; the output in hyperparameter_grid.txt will replace these with JSON’s false, true, and null, respectively, as required in this section of the input file.

`bayesian` workflow

The bayesian workflow refers to a Bayesian-based hyperparameter optimization in which information about how well prior sets of hyperparameters performed is used to determine the next sets of hyperparameters to try. In this way the HPO algorithm does not sample the full space of hyperparameter values and instead iteratively homes in on the best set of hyperparameters. Compared to a full grid search, this can save significant time when the hyperparameter space is large and the model takes a long time to run on the training data. One drawback is that it is more difficult to observe exactly how each hyperparameter or hyperparameter combination directly affects the model's performance.

The Bayesian algorithm used in CANDLE is an R package called mlrMBO. Briefly, after the (hyper)parameter space has been defined, the algorithm chooses design_size evenly-spaced points throughout the space and runs the model on those design_size sets of hyperparameters. A random forest model (called a "surrogate model") then fits the hyperparameters run to their resulting performance metrics (specified either by the candle_value_to_return variable or the history variable as explained above) and produces propose_points new sets of hyperparameters it believes may minimize the metric. The model is then run on these new sets of hyperparameters, after which the algorithm incorporates these hyperparameters and their resulting performance metrics into the surrogate model and then proposes propose_points new sets of hyperparameters to try within the defined parameter space. This process is repeated until convergence to the "best" set of hyperparameters or if max_iterations iterations have been run or max_budget total model runs have been performed.

For more details, please see the mlrMBO package documentation.

The &param_space section for the bayesian workflow is based on the makeParamSet function in the ParamHelpers R package. Each line in this section is what would be an argument to makeParamSet() (without the commas separating the arguments); the formatting for this section should be based on this argument format. It is relatively intuitive to understand; e.g., here is the &param_space section in the bayesian template input file:

makeDiscreteParam("batch_size", values = c(16, 32))
makeIntegerParam("epochs", lower = 2, upper = 5)
makeDiscreteParam("optimizer", values = c("adam", "sgd", "rmsprop", "adagrad", "adadelta"))
makeNumericParam("drop", lower = 0, upper = 0.9)
makeNumericParam("learning_rate", lower = 0.00001, upper = 0.1)

This defines the possible values that the hyperparameters batch_size, epochs, optimizer, drop, and learning_rate can take on during the running of the bayesian workflow. Please see the Param help page for individual usage of each type of constructor function.

Running the CANDLE Job

After adapting your model script to work with CANDLE and creating a CANDLE input file, you will almost be ready to run the hyperparameter optimization (HPO) using CANDLE.

However, even though you should have already ensured that your original model ran successfully on Biowulf, you should make sure that you have adapted it to work with CANDLE and created the input file successfully by running your model script using the default set of hyperparameters (set in the &default_model section) without running a full HPO workflow.

To do this, you should request an interactive node on Biowulf (using e.g. sinteractive --gres=gpu:k80:1 --mem=20G), setting the run_workflow keyword in the &control section of the input file to 0, and then running the model script using candle submit-job <INPUT-FILE>. By observing the resulting output to the screen and the results of the model script in the generated file subprocess_out_and_err.txt, you will be able to confirm whether the model ran correctly.

It is crucial to perform this last check because it almost always catches errors in the model itself or in its adaptation to CANDLE that would show up anyway when running the full CANDLE job (using run_workflow=1). However, catching the mistakes this way (using run_workflow=0 on an interactive node) allows you to correct these issues quickly and to immediately test your changes. Quickly using run_workflow=0 on an interactive node will almost guarantee that the full CANDLE workflow will run without a hitch.

After you have successfully tested your model script using run_workflow=0 on an interactive node, then in order to run the full CANDLE workflow, simply exit the interactive session (by typing exit), set the run_workflow keyword in the input file to 1, and then submit the full CANDLE job using the same exact command: candle submit-job <INPUT-FILE>.

You will know you have successfully submitted the CANDLE job to SLURM if, after all the text that is outputted to the screen, you see the lines

JOB_ID=<YOUR-SLURM-JOB-ID>
Input file submitted successfully

before the command prompt is returned to you. At this point, the CANDLE job has been submitted to Biowulf's SLURM scheduler just like any other Biowulf job, whose progress you can monitor, e.g., by running squeue -u <YOUR-BIOWULF-USERNAME>.

Once your job has completed running, you can check the results by entering the last-candle-job directory (a symbolic link) and ensuring the output in the file output.txt looks reasonable; namely, it should end with something like

MPIEXEC TIME: 263.763
EXIT CODE: 0
COMPLETE: 2021-01-14 18:56:51

Then, enter the run subdirectory, where the results of your model script run using each hyperparameter set will lie. In particular, you can check the output of the model run using each hyperparameter set by observing the subprocess_out_and_err.txt files using less */subprocess_out_and_err.txt. (These files contain the model's raw output, i.e., what you'd expect to be printed to the terminal if you ran the model completely outside of CANDLE.) Further, for each hyperparameter set using which your model ran successfully, a file called result.txt will also be present containing the value specified by candle_value_to_return (or history).

See the following section for an automated way of observing the results of your HPO job.

Note: Every time you submit a CANDLE job using candle submit-job <INPUT-FILE>, a new subdirectory in candle_generated_files/experiments will be created with the automatically generated name of the job (e.g., X000, X001, X002, etc.). Further, the symbolic link last-candle-job will always be updated to point to the most recently run CANDLE job in this experiments directory.

Tip: If your CANDLE job dies, looking inside the subprocess_out_and_err.txt files will generally indicate why.

Aggregating CANDLE Job Results

In order to collect the values of all the hyperparameter settings as well as the resulting metric for each set, run the aggregate-results command to candle:

[user@biowulf]$ candle aggregate-results <EXP-DIR> [<RESULT-FORMAT>]

where <EXP-DIR> is the experiment directory (or the symbolic link last-candle-job), i.e., that containing the run directory, and <RESULT-FORMAT> is an optional string containing the standard printf()-formatted string containing the output format for the metric. For example, if the r template/example (an older version of it) were run inside the /data/$USER/candle directory, then running

[user@biowulf]$ candle aggregate-results /data/$USER/candle/last-candle-job

would produce a file called candle_results.csv in the candle_generated_files directory containing the data from all the jobs, sorted by increasing metric value, e.g.,

result,dirname,id,mincorr,maxcorr,number_cv,extfolds
000.796,hpset_00001,hpset_00001,0.200000,0.80,2,5
000.796,hpset_00004,hpset_00004,0.200000,0.80,5,5
000.837,hpset_00002,hpset_00002,0.200000,0.80,3,5
000.878,hpset_00003,hpset_00003,0.200000,0.80,4,5
000.905,hpset_00007,hpset_00007,0.200000,0.80,8,5
000.964,hpset_00005,hpset_00005,0.200000,0.80,6,5
000.964,hpset_00006,hpset_00006,0.200000,0.80,7,5
001.000,hpset_00008,hpset_00008,0.200000,0.80,9,5

This file can be further processed using Excel or any other method in order to study the results of the HPO.

Note: Since the field names (first line above) are extracted for just one of the hyperparameter sets, if the hyperparameters that are run are not the same for every set of hyperparameters run using the grid workflow, then the results of the aggregate-results command will not make sense. For example, running this command on the results of the grid template/example will produce

result,dirname,id,epochs,activation
000.064,hpset_01,hpset_01,15,tanh
000.066,hpset_07,hpset_07,10,512
000.074,hpset_06,hpset_06,10,256
000.080,hpset_02,hpset_02,30,tanh
000.081,hpset_05,hpset_05,10,128
000.098,hpset_03,hpset_03,15,relu
000.121,hpset_04,hpset_04,30,relu

Of course, if the generate-grid command to candle were used to generate the sets of hyperparameters to run (and the resulting sets were not further modified by hand), then the results of the aggregate-results command to candle should always make sense.

As usual, the full pathname must be used for <EXP-DIR>. Tip: As usual, use $(pwd) to automatically include the full path in front of a relative file, e.g., candle aggregate-results $(pwd)/last-candle-job.

Summary of candle Commands

As long as the candle module is loaded (module load candle), the available commands to the candle program (in the format candle <COMMAND> <COMMAND-ARG-1> <COMMAND-ARG-2> ...) are as follows:

`candle import-template <grid\|bayesian\|r\|bash>`	Copy a CANDLE template to the current directory
`candle generate-grid <PYTHON-LIST-1> <PYTHON-LIST-2> ...`	Generate a hyperparameter grid for the `grid` search workflow
`candle submit-job <INPUT-FILE>`	Submit a CANDLE job
`candle aggregate-results <EXP-DIR> [<RESULT-FORMAT>]`	Create a CSV file called `candle_results.csv` containing the hyperparameters and corresponding performance metrics

Tip: Leaving <COMMAND> blank or setting it to help will display this usage menu.

Promoting CANDLE and Your Work

If you've successfully used CANDLE to advance your work and you're willing to tell us about it, please email the SDSI team to tell us what you've done! We'd love to learn how users are using CANDLE to address their needs so that we can continue to improve CANDLE and its implementation on Biowulf.

Further, if you're willing to have your work promoted online, please include an exemplary graphic of your work, and upon review we'll post it here as an exemplar CANDLE success story. More exposure for you, more exposure for us!

Or, if you've unsuccessfully used CANDLE to advance your work, we'd love to help you out; please let us know what didn't work for you!

Contact Information

Feel free to email the SDSI team with any questions, comments, or suggestions.

In addition, our team has expertise in building machine/deep learning models for a variety of situations (e.g., image segmentation, classification from RNA-Seq data, etc.) and would be happy to help you build a model (independent of CANDLE) or point you in the right direction. (And, we are happy to collaborate!)

Step 1: Set up your environment

Step 2: Copy a template submission script to the working directory

Step 3: Run the job

Step 1: Specify the hyperparameters

Step 2: Define the metric you would like to minimize

Note on minimization metric

Section 1: &control

Section 2: &default_model

Section 3: &param_space

grid workflow

bayesian workflow

Section 1: `&control`

Section 2: `&default_model`

Section 3: `&param_space`

`grid` workflow

`bayesian` workflow