CANDLE (CANcer Distributed Learning Environment) is an open-source software platform providing deep learning methodologies that scales very efficiently on the world’s fastest supercomputers. Developed initially to address three top challenges facing the cancer community, CANDLE increasingly can be used to tackle problems in other application areas. The SDSI team at the Frederick National Laboratory for Cancer Research, sponsored by the National Cancer Institute, has installed CANDLE on NIH’s Biowulf supercomputer for all with Biowulf access to use.
One of CANDLE's primary attributes is its functionality for performing hyperparameter optimization (HPO). In a machine/deep learning model, "hyperparameters" refer to any variables that define the model aside from the model’s "weights." For a given set of hyperparameters (typically 5-20), the corresponding model’s weights (typically tens of thousands) are iteratively optimized using algorithms such as gradient descent. Such optimization of the model’s weights – a process called "training" – is typically run very efficiently on graphics processing units (GPUs) when the model is a neural network (deep learning) and typically takes 30 minutes to a couple of days.
If a measure of loss is assigned to each model trained on the same set of data, we would like to ultimately choose the model (i.e., set of hyperparameters) that best fits that dataset by minimizing the loss. HPO is this process of choosing the best set of hyperparameters. The most common way of determining the optimal set of hyperparameters is to run one training job for every desired combination of hyperparameters and choose that which produces the lowest loss. Such a workflow is labeled in CANDLE by "grid" (it is called in other contexts "grid search"). Another way of determining the optimal set of hyperparameters is to use a Bayesian approach in which information about how well prior sets of hyperparameters performed is used to select the next sets of hyperparameters to try. This type of workflow is labeled in CANDLE by "bayesian".
Finally, HPO need not be used for only machine/deep learning
applications; it can be applied to any computational pipeline
that can be parametrized by a number of settings. This web
page serves as a complete guide to running CANDLE on Biowulf.
Click here for the slides and here for the recording of the talk Hyperparameter Optimization Using CANDLE on Biowulf presented on 1/19/21 by the NCI Data Science Learning Exchange.
Why use CANDLE in the first place? For example, why not just
submit a swarm
of jobs, each using a different set of hyperparameters?
- Load balancing. CANDLE uses a program called Swift/T to ensure the resources (CPUs or GPUs) allocated to you by Biowulf's batch system (SLURM) are used as efficiently as possible with minimal downtime. Since often the jobs will not take the same or even similar amounts of time, using a utility like Swarm could lead to significant downtime on some of the allocated resources. CANDLE, and Swift/T in particular, will submit a job to a resource only if the resource is ready to take another job. This way, there will be minimal downtime on any of the resources even if the individual jobs take different amounts of time.
- Intelligent hyperparameter selection. Further, if
the
bayesian
workflow is selected, the sets of hyperparameters to run need not be known beforehand; only the space of hyperparameters need be specified, and only sets of hyperparameters that the Bayesian algorithm determines will most likely minimize a particular metric will be run. In other words, CANDLE allows you to intelligently generate the best sets of hyperparameters to try on-the-fly. - Ready-to-use framework. Our implementation of
CANDLE on Biowulf requires that you do the absolute bare-minimum needed to
run hyperparameter optimizations. CANDLE has been, and
continues to be, actively developed and perfected to perform
at a high level on large HPC systems such as Biowulf.
These steps will get you running a sample CANDLE job on Biowulf right away!
Step 1: Set up your environment
Once logged
in to Biowulf, set up your environment by creating and
entering a working directory in your /data/$USER
(not /home/$USER
) directory and loading
the candle
module (user input in bold):
[user@biowulf]$ cd /data/$USER/candle
[user@biowulf]$ module load candle
Step 2: Copy a template submission script to the working directory
Copy one of the four CANDLE templates to the working directory:
Possible values of <TEMPLATE>
are:
grid |
Grid search using a Python model (simple deep neural
network on the MNIST dataset; ~5 min. total runtime) |
bayesian |
Bayesian search using a Python model (one of the JDACS4C
Pilot 1 models, a 1D convolutional network for
classifying RNA-Seq gene expression profiles into normal
or tumor tissue categories; ~40 min. total runtime) |
r |
Grid search using an R model (feature reduction on the
TNBC
dataset; ~5 min. total runtime) |
bash |
Grid search using a bash model (this is a
bash wrapper around the grid example
above, which is a simple deep neural network on the
MNIST dataset; ~5 min. total runtime) |
Step 3: Run the job
Submit the job by running:
This section contains a summary of steps for running your own
CANDLE job, which are detailed in the following sections.
- Ensure your model script
already works on a Biowulf compute node. The
model must be written in Python, R, or bash.
- Adapt your model script to work with CANDLE. Only two, minor modifications need be made:
- Specify the
hyperparameters using the
candle_params
dictionary (Python) or data.frame (R).
- Specify a return value
using the
candle_value_to_return
variable (or Kerashistory
object if using Python+Keras). - The
bash
example is a bit more involved; see the example or email us for assistance.
- Load the
candle
module:module load candle
. Among other things, this sets the value of the$CANDLE
environment variable.
- Create a single
CANDLE input file. This is easiest done by modifying
one of the template input files that can be imported using
candle import-template {grid,bayesian,r,bash}
. - Confirm your model runs
using CANDLE without running a full workflow. Request
an interactive node from SLURM, set
run_workflow=0
in the&control
section of the input file, and run the model script usingcandle submit-job <INPUT-FILE>
as usual.
- Submit the CANDLE job
using
candle submit-job <INPUT-FILE>
. Ensure this is done from the/data/$USER
directory (as opposed to/home/$USER
).
- Collect the job results
using
candle aggregate-results <EXP-DIR> [<RESULT-FORMAT>]
.
Prior to adapting your own model script (i.e., machine/deep
learning model or general workflow) for use with CANDLE, you
must ensure it runs standalone on a Biowulf compute node. Skipping
this step is the most common error new CANDLE users make.
If your model does not work outside of CANDLE, you cannot
expect it to work using CANDLE!
See the Biowulf
user guide for information on running scripts on
Biowulf. For example, you can test a model that utilizes a GPU
by requesting an interactive GPU node (e.g., sinteractive
--gres=gpu:k80:1 --mem=20G
) and then running the
model like, e.g., python my_model_script.py
or Rscript
my_model_script.R
; don’t forget to use the correct
version of Python or R, if required!
Once you have confirmed that your model script runs as-is on
Biowulf, modify it in two simple ways. Note that while CANDLE
accepts model scripts written in Python, R, or bash, the
following addresses model scripts written in only Python or R;
as bash model scripts are more involved, see the bash example or email us for
assistance.
Step 1: Specify the hyperparameters
Specify the hyperparameters in your code using a variable
named candle_params
of the dictionary (Python)
or data.frame (R) datatypes. E.g., in Python, if your model
script my_model_script.py
contains
batch_size = 128
but these are parameters that you'd like to change during the CANDLE workflow, you should change those lines to
batch_size = candle_params['batch_size']
Note: The "key" in the candle_params
dictionary should match the variable names in the CANDLE input
file (following section),
whereas the variables to which they are assigned in the model
script should obviously match the names used in the rest of
the script.
Likewise, in R, if your model script my_model_script.R
contains
batch_size <- 128
you should change those lines to
batch_size <- candle_params[["batch_size"]]
Step 2: Define the metric you would like to minimize
If your model is written in Python, either define a Keras
history object named history
(as in, e.g., the
return value of a model.fit()
method; validation
loss will be minimized), e.g.,
or define a single number named candle_value_to_return
that contains the value you would like to minimize, e.g.,
candle_value_to_return = score[0]
Note: Assuming you have named your Keras model model
as in the example above, if you are using as your minimization
metric the return value from model.fit()
(as
opposed to using candle_value_to_return
), you
must specify the validation_data
keyword in the
call to model.fit()
as shown above. This way the
history
attribute of model.fit()
's
return value will contain a key called val_loss
,
which is the metric that CANDLE will use to evaluate the
current set of hyperparameters. (Choosing the best set of
hyperparameters based on a holdout dataset such as a
validation dataset is good practice anyway!) (If your model
still doesn't seem to generate a val_loss
key,
try adding metrics=['accuracy']
in the call to model.compile()
.)
If your model is written in R, define a single number named candle_value_to_return
that contains the metric you would like to minimize, e.g.,
Note on minimization metric
Only the bayesian
workflow actually uses
the minimization metric since by definition in order for it to
determine the next sets of hyperparameters to try it needs a
measure of how "well" prior sets of hyperparameters
performed. Since the grid
workflow by
definition runs training on all sets of
hyperparameters regardless of any measure of how "well" prior
sets performed, it never actually uses the
minimization metric. However, the candle_value_to_return
variable (or the history
object in Python) is always
required, so when running the grid
workflow and
you don't care to return any particular result from your model
script, simply set it to a dummy value such as -7
.
However, please keep in mind that the minimization metric will
be utilized if using the aggregate-results
command to candle
to automatically inspect
your CANDLE job; it is still good practice to define a
minimization metric in your model.
Typical physical values assigned to candle_value_to_return
include the training, testing, or validation loss (for a
machine/deep learning model) or the workflow runtime (for
optimizing workflow runtimes as in, e.g., benchmarking).
In order to use CANDLE to run your own model script, you need
to create an input file containing three sections:
- A
&control
section containing general settings - A
&default_model
section containing the default hyperparameter values - A
¶m_space
section specifying the space of possible values of the hyperparameters
The input file should have a .in
extension,
though it is not required. A typical input file looks like:
model_script="$(pwd)/mnist_mlp.py"
workflow="grid"
ngpus=2
gpu_type="k80" # this is a sample inline comment
walltime="00:20:00"
run_workflow = 1
/
&default_model
epochs=20
batch_size=128
activation='relu'
optimizer='rmsprop'
num_filters=32
/
# This is a sample full-line comment
¶m_space
{"id": "hpset_01", "epochs": 15, "activation": "tanh"}
{"id": "hpset_02", "epochs": 30, "activation": "tanh"}
{"id": "hpset_03", "epochs": 15, "activation": "relu"}
{"id": "hpset_04", "epochs": 30, "activation": "relu"}
{"id": "hpset_05", "epochs": 10, "batch_size": 128}
{"id": "hpset_06", "epochs": 10, "batch_size": 256}
{"id": "hpset_07", "epochs": 10, "batch_size": 512}
/
Each section must be preceded by the section name (preceded
with an ampersand symbol &
) on a separate
line and followed by a forward slash /
on a
separate line. The sections can appear in any order and their
names must be one of control
, default_model
,
or param_space
.
Comments preceded by pound signs #
are allowed,
either on part of a line or on the entire line, just as in the
bash programming language; see the sample input file above.
The three sections of the input file are explained in more
detail below. The first two (&control
and &default_model
)
consist of settings of the format left-hand-side =
right-hand-side
. Spaces on either side of the equals
sign =
do not matter. The third section (¶m_space
)
has a different format depending on whether the grid
or bayesian
workflows are specified by the workflow
setting in the &control
section.
In general, files should always use absolute paths,
e.g., /path/to/myfile.ext
instead of myfile.ext
.
In order to enforce this for files present in the same
directory from which candle submit-job
<INPUT-FILE>
is called, you can use $(pwd)
as the path in the &control
section, e.g., $(pwd)/myfile.ext
.
While not necessary, strings may be quoted (err on the side
of double quotes "
). Finally, whitespace at the
beginnings of the lines making up the section bodies does not
have any effect aside from making the input file easier to
read.
Tip: A useful way to remember the section names,
format, and typical settings is to adapt any of the templates
above (i.e., grid
,
bayesian
, r
, or bash
)
to your use case. Feel free to run candle
import-template <TEMPLATE>
with different <TEMPLATE>
settings and examine the
input file that is copied over in order to better understand
what it does and the types of settings it can contain.
Section 1: &control
Notes: String settings in this section only can access
calls to bash, e.g., model_script =
”/data/$USER/candle/mnist.py"
or model_script="$(pwd)/mnist_mlp.py"
.
Internally, all settings in this section are converted to
uppercase bash variables prepended by "CANDLE_", e.g., the
value assigned to the model_script
setting
actually gets assigned to the bash variable $CANDLE_MODEL_SCRIPT
.
Models written in either language
model_script
(Required)
- This should point to the Python, R, or bash script that
you would like to run. E.g.,
model_script = ”/data/$USER/candle/mnist.py"
. This script must have been adapted to work with CANDLE (see the previous section). The filename extension will automatically determine whether Python, R, or bash will be used to run the model. workflow
(Required)
- Which CANDLE workflow to use. Currently supported are the
grid
andbayesian
workflows. E.g.,workflow = "grid"
. worker_type
- Either
cpu
or type of GPU (k20x
,k80
,p100
,v100
, orv100x
) you would like to use to run each set of hyperparameters on yourmodel_script
, the number of which will be specified by thenworkers
keyword below. Default isk80
.
nworkers
- Number of workers (CPUs or GPUs depending on the setting
of
worker_type
above) you would like to use for the CANDLE job. E.g.,nworkers = 2
. Default is1
. Note: One (grid
workflow) or two (bayesian
workflow) extra CPU processes will be allocated in order run background processes. nthreads
- The number of CPUs to use per MPI task. Default is
1
.
walltime
- How long you would like your job to run (the wall time of
your entire job including all hyperparameter sets). E.g.,
walltime = "00:20:00"
. Format isHH:MM:SS
. When in doubt, round up so that the job is most likely to complete. Default is00:05:00
. custom_sbatch_args
- Custom arguments to SLURM's batch processor,
sbatch
, such as anlscratch
setting. Default is empty. mem_per_cpu
- Memory in GB to request from SLURM per CPU process.
Default is
7
.
supp_modules
- Modules you would like to have loaded while your model is
run. E.g.,
supp_modules = "CUDA/10.0 cuDNN/7.5/CUDA-10.0"
(these particular example settings may be necessary for running TensorFlow when using a custom Conda installation). Default is empty.
extra_script_args
- Command-line arguments you’d like to include when invoking
python
orRscript
. E.g., for R model scripts,extra_script_args = "--max-ppsize=100000"
. In other words, the model will ultimately be run likepython $EXTRA_SCRIPT_ARGS my_model_script.py
orRscript $EXTRA_SCRIPT_ARGS my_model_script.R
. Default is empty.
run_workflow
- Whether (
1
, default) or not (0
) to run the actual workflow specified by theworkflow
keyword. If not set, a single run of the model using the hyperparameters specified in the&default_model
section of the input file will be run on the current machine, sorun_workflow=0
must be run only on an interactive node. Testing your model first withrun_workflow=0
is a crucial part of the CANDLE testing procedure; see below. dry_run
- Whether (
1
) or not (0
, default) to simply set up the CANDLE job without actually submitting the job to SLURM. This allows you to study the automatically generated files and settings if you suspect something in CANDLE proper is going awry.
bayesian
workflow only
design_size
- Total number of points to sample within the hyperparameter
space prior to running the mlrMBO
algorithm. E.g.,
design_size = 9
(default10
). Note that this must be greater than or equal to the largest number of possible values for any discrete hyperparameter specified in the¶m_space
section. A reasonable value for this (and forpropose_points
, below) is 15-20.
propose_points
- Number of proposed (really evaluated) points at each MBO
iteration. E.g.,
propose_points = 9
(default10
). A reasonable value for this (and fordesign_size
, above) is 15-20.
max_budget
- Maximum total number of function
evaluations for all iterations combined. E.g.,
max_budget = 180
(default110
). max_iterations
- Maximum number of sequential
optimization steps. E.g.,
max_iterations = 3
(default10
).
Python models only
python_bin_path
- If you don’t want to use the Python version with which
CANDLE was built (currently python/3.7), you can set this to
the location of the Python binary you would like to use.
Examples:
python_bin_path = "$CONDA_PREFIX/envs/<YOUR_CONDA_ENVIRONMENT_NAME>/bin"
python_bin_path = "/data/BIDS-HPC/public/software/conda/envs/main3.6/bin" - If set, it will override the setting of
exec_python_module
, below. Default is empty.
exec_python_module
- If you’d prefer loading a module rather than specifying
the path to the Python binary (above), set this to the name
of the Python module you would like to load. E.g.,
exec_python_module = "python/3.8"
. This setting will have no effect ifpython_bin_path
(above) is set. If neitherpython_bin_path
norexec_python_module
is set, then the version of Python with which CANDLE was built (currently python/3.7) will be used. Default is empty.
supp_pythonpath
- This is a supplementary setting of the
$PYTHONPATH
environment variable that will be searched for libraries that can’t otherwise be found. Examples: -
supp_pythonpath = "/data/BIDS-HPC/public/software/conda/envs/main3.6/lib/python3.6/site-packages"
supp_pythonpath = "/data/$USER/conda/envs/my_conda_env/lib/python3.6/site-packages"
Default is empty.
Tip: Multiple paths can be set by separating them with a colon.
dl_backend
- Deep learning library to use. E.g.,
dl_backend = "pytorch"
. Should be eitherkeras
(default) orpytorch
. Only required if deep learning using Keras or PyTorch is requested; e.g., this is not used if only machine learning usingscikit-learn
is employed. Note: This keyword is probably unnecessary in all cases, e.g., even setting it tokeras
when PyTorch is actually used does not seem to matter.
R models only
exec_r_module
- If you don’t want to use the R version with which CANDLE
was built (currently R/4.0.0), set this to the name of the R
module you would like to load. E.g.,
exec_r_module = "R/3.6.0"
. Default is empty. supp_r_libs
- This is a supplementary setting of the
$R_LIBS
environment variable that will be searched for libraries that can’t otherwise be found. E.g.,supp_r_libs = "/data/BIDS-HPC/public/software/R/3.6/library"
. Default is empty. Tip: R will search your standard library location on Biowulf (~/R/%v/library
), so feel free to just install your own R libraries there.
Section 2: &default_model
This section contains the default settings of the
hyperparameters defined in the model script, some or all of
whose values will be overwritten by those specified in the ¶m_space
section, below. Every hyperparameter specified in the model
script must have a default setting specified here.
This section should be otherwise self-explanatory from the
sample &default_model
section above.
Tip: This is a great place to define constants in your
model script (such as a URL from where the training data
should be downloaded), rather than hardcoding them in to the
model script. E.g., you can replace the line
in your Python model script with
and place the line
in the &default_model
section of the CANDLE
input file. This way, all settings can be changed from a
single input file.
Note: If you wish, you can replace the contents of
this section with the full path to a file as a value to the candle_default_model_file
keyword, e.g., candle_default_model_file =
$CANDLE/Benchmarks/Pilot1/NT3/nt3_default_model.txt
.
The contents of this file should start with a line containing
[Global_Params]
.
Section 3: ¶m_space
This section contains how some or all of the hyperparameters
defined in the model script (and the &default_model
section) are to be varied during a hyperparameter optimization
workflow.
Note: If you wish, you can replace the contents of
this section with the full path to a file as a value to the candle_param_space_file
keyword, e.g., candle_param_space_file =
$CANDLE/Supervisor/workflows/mlrMBO/data/nt3_nightly.R
.
grid
workflow
The grid
workflow refers to a "grid
search" hyperparameter optimization in which generally
the hyperparameters are varied evenly throughout a specified
parameter space.
In the ¶m_space
section for this
workflow, each line must be a JSON string specifying the
values of the hyperparameters to use in each job, and each
string must contain an id
key containing a
unique name for the hyperparameter set, e.g.:
{"id": "hpset_02", "epochs": 30, "activation": "tanh"}
{"id": "hpset_03", "epochs": 15, "activation": "relu"}
{"id": "hpset_04", "epochs": 30, "activation": "relu"}
{"id": "hpset_05", "epochs": 10, "batch_size": 128}
{"id": "hpset_06", "epochs": 10, "batch_size": 256}
{"id": "hpset_07", "epochs": 10, "batch_size": 512}
Note: This example implies that the epochs
,
activation
, and batch_size
hyperparameters must be defined in the &default_model
section. It further shows that the full "grid" of values need
not be run in the grid
workflow; in fact, you
can customize by hand every set of hyperparameter values that
you'd like to run.
Note: Python’s False
, True
,
and None
, should be replaced by JSON’s false
,
true
, and null
in the ¶m_space
section for the grid
workflow.
Alternatively, you can use the generate-grid
candle
command to create a file called hyperparameter_grid.txt
(inside a directory called candle_generated_files
)
containing a full "grid" of hyperparameters. The usage is candle
generate-grid <PYTHON-LIST-1> <PYTHON-LIST-2>
...
, where each <PYTHON-LIST>
is
a Python list
whose first element is a string
containing the hyperparameter name and the second argument is
an iterable of hyperparameter values (numpy
functions can be accessed using the np
variable). For example, running
will create a file called hyperparameter_grid.txt
with the contents
{"id": "hpset_00002", "nlayers": 5, "dir": "y"}
{"id": "hpset_00003", "nlayers": 5, "dir": "z"}
{"id": "hpset_00004", "nlayers": 7, "dir": "x"}
{"id": "hpset_00005", "nlayers": 7, "dir": "y"}
{"id": "hpset_00006", "nlayers": 7, "dir": "z"}
{"id": "hpset_00007", "nlayers": 9, "dir": "x"}
{"id": "hpset_00008", "nlayers": 9, "dir": "y"}
{"id": "hpset_00009", "nlayers": 9, "dir": "z"}
{"id": "hpset_00010", "nlayers": 11, "dir": "x"}
{"id": "hpset_00011", "nlayers": 11, "dir": "y"}
{"id": "hpset_00012", "nlayers": 11, "dir": "z"}
{"id": "hpset_00013", "nlayers": 13, "dir": "x"}
{"id": "hpset_00014", "nlayers": 13, "dir": "y"}
{"id": "hpset_00015", "nlayers": 13, "dir": "z"}
Note: The candle
module must be loaded
in order to run any of the candle
commands such
as generate-grid
.
The contents of the file hyperparameter_grid.txt
should then be placed in the body of the ¶m_space
section of the input file, or, as mentioned above, the file
could be pointed to by a candle_param_space_file
keyword setting in the body of this section.
A more complete example producing a 600-line file (600 sets
of hyperparameters) is
No spaces can be present in any of the arguments to the generate-grid
command.
Note: Use Python’s False
, True
,
and None
if using the generate-grid
command; the output in hyperparameter_grid.txt
will replace these with JSON’s false
, true
,
and null
, respectively, as required in this
section of the input file.
bayesian
workflow
The bayesian
workflow refers to a Bayesian-based
hyperparameter optimization in which information about
how well prior sets of hyperparameters performed is used to
determine the next sets of hyperparameters to try. In this way
the HPO algorithm does not sample the full space of
hyperparameter values and instead iteratively homes in on the
best set of hyperparameters. Compared to a full grid search,
this can save significant time when the hyperparameter space
is large and the model takes a long time to run on the
training data. One drawback is that it is more difficult to
observe exactly how each hyperparameter or hyperparameter
combination directly affects the model's performance.
The Bayesian algorithm used in CANDLE is an R package called
mlrMBO. Briefly, after the (hyper)parameter space has been
defined, the algorithm chooses design_size
evenly-spaced points throughout the space and runs the model
on those design_size
sets of hyperparameters. A
random forest model (called a "surrogate model") then fits the
hyperparameters run to their resulting performance metrics
(specified either by the candle_value_to_return
variable or the history
variable as explained above) and produces propose_points
new sets of hyperparameters it believes may minimize the
metric. The model is then run on these new sets of
hyperparameters, after which the algorithm incorporates these
hyperparameters and their resulting performance metrics into
the surrogate model and then proposes propose_points
new sets of hyperparameters to try within the defined
parameter space. This process is repeated until convergence to
the "best" set of hyperparameters or if max_iterations
iterations have been run or max_budget
total
model runs have been performed.
For more details, please see the mlrMBO package documentation.
The ¶m_space
section for the bayesian
workflow is based on the makeParamSet
function in the ParamHelpers
R package. Each line in this section is what would be an
argument to makeParamSet()
(without the commas
separating the arguments); the formatting for this section
should be based on this argument format. It is relatively
intuitive to understand; e.g., here is the ¶m_space
section in the bayesian
template input file:
makeIntegerParam("epochs", lower = 2, upper = 5)
makeDiscreteParam("optimizer", values = c("adam", "sgd", "rmsprop", "adagrad", "adadelta"))
makeNumericParam("drop", lower = 0, upper = 0.9)
makeNumericParam("learning_rate", lower = 0.00001, upper = 0.1)
This defines the possible values that the hyperparameters batch_size
,
epochs
, optimizer
, drop
,
and learning_rate
can take on during the running
of the bayesian
workflow. Please see the Param
help page for individual usage of each type of
constructor function.
After adapting your model
script to work with CANDLE and creating a CANDLE input
file, you will almost be ready to run the hyperparameter
optimization (HPO) using CANDLE.
However, even though you should have already ensured that your original
model ran successfully on Biowulf, you should make sure that
you have adapted it to work with CANDLE and created the input
file successfully by running your model script using the
default set of hyperparameters (set in the &default_model
section) without running a full HPO workflow.
To do this, you should request an interactive node on Biowulf
(using e.g. sinteractive --gres=gpu:k80:1 --mem=20G
),
setting the run_workflow
keyword in the &control
section of the input file to 0
, and then running
the model script using candle submit-job
<INPUT-FILE>
. By observing the resulting output
to the screen and the results of the model script in the
generated file subprocess_out_and_err.txt
, you
will be able to confirm whether the model ran correctly.
It is crucial to perform this last check because it almost
always catches errors in the model itself or in its adaptation
to CANDLE that would show up anyway when running the full
CANDLE job (using run_workflow=1
). However,
catching the mistakes this way (using run_workflow=0
on an interactive node) allows you to correct these issues
quickly and to immediately test your changes. Quickly
using run_workflow=0
on an
interactive node will almost guarantee that the full CANDLE
workflow will run without a hitch.
After you have successfully tested your model script using run_workflow=0
on an interactive node, then in order to run the full CANDLE
workflow, simply exit the interactive session (by typing exit
),
set the run_workflow
keyword in the input file
to 1
, and then submit the full CANDLE job using
the same exact command: candle submit-job
<INPUT-FILE>
.
You will know you have successfully submitted the CANDLE job
to SLURM if, after all the text that is outputted to the
screen, you see the lines
Input file submitted successfully
before the command prompt is returned to you. At this point,
the CANDLE job has been submitted to Biowulf's SLURM scheduler
just like any other Biowulf job, whose progress you can
monitor, e.g., by running squeue -u
<YOUR-BIOWULF-USERNAME>
.
Once your job has completed running, you can check the
results by entering the last-candle-job
directory (a symbolic link) and ensuring the output in the
file output.txt
looks reasonable; namely, it
should end with something like
EXIT CODE: 0
COMPLETE: 2021-01-14 18:56:51
Then, enter the run
subdirectory, where the
results of your model script run using each hyperparameter set
will lie. In particular, you can check the output of the model
run using each hyperparameter set by observing the subprocess_out_and_err.txt
files using less */subprocess_out_and_err.txt
.
(These files contain the model's raw output, i.e., what you'd
expect to be printed to the terminal if you ran the model
completely outside of CANDLE.) Further, for each
hyperparameter set using which your model ran successfully, a
file called result.txt
will also be present
containing the value specified by candle_value_to_return
(or history
).
See the following section for an
automated way of observing the results of your HPO job.
Note: Every time you submit a CANDLE job using candle
submit-job <INPUT-FILE>
, a new subdirectory in
candle_generated_files/experiments
will be
created with the automatically generated name of the job
(e.g., X000
, X001
, X002
,
etc.). Further, the symbolic link last-candle-job
will always be updated to point to the most recently run
CANDLE job in this experiments
directory.
Tip: If your CANDLE job dies, looking inside the subprocess_out_and_err.txt
files will generally indicate why.
In order to collect the
values of all the hyperparameter settings as well as the
resulting metric for each set, run the aggregate-results
command to candle
:
where <EXP-DIR>
is the experiment
directory (or the symbolic link last-candle-job
),
i.e., that containing the run
directory, and <RESULT-FORMAT>
is an optional string containing the standard printf()
-formatted
string containing the output format for the metric. For
example, if the r
template/example (an older
version of it) were run inside the /data/$USER/candle
directory, then running
would produce a file called candle_results.csv
in the candle_generated_files
directory
containing the data from all the jobs, sorted by increasing
metric value, e.g.,
000.796,hpset_00001,hpset_00001,0.200000,0.80,2,5
000.796,hpset_00004,hpset_00004,0.200000,0.80,5,5
000.837,hpset_00002,hpset_00002,0.200000,0.80,3,5
000.878,hpset_00003,hpset_00003,0.200000,0.80,4,5
000.905,hpset_00007,hpset_00007,0.200000,0.80,8,5
000.964,hpset_00005,hpset_00005,0.200000,0.80,6,5
000.964,hpset_00006,hpset_00006,0.200000,0.80,7,5
001.000,hpset_00008,hpset_00008,0.200000,0.80,9,5
This file can be further processed using Excel or any other
method in order to study the results of the HPO.
Note: Since the field names (first line above) are
extracted for just one of the hyperparameter sets, if the
hyperparameters that are run are not the same for every set of
hyperparameters run using the grid
workflow,
then the results of the aggregate-results
command will not make sense. For example, running this command
on the results of the grid
template/example will
produce
000.064,hpset_01,hpset_01,15,tanh
000.066,hpset_07,hpset_07,10,512
000.074,hpset_06,hpset_06,10,256
000.080,hpset_02,hpset_02,30,tanh
000.081,hpset_05,hpset_05,10,128
000.098,hpset_03,hpset_03,15,relu
000.121,hpset_04,hpset_04,30,relu
Of course, if the generate-grid
command to candle
were used to generate the sets of hyperparameters to run (and
the resulting sets were not further modified by hand), then
the results of the aggregate-results
command to
candle
should always make sense.
As usual, the full pathname must be used for <EXP-DIR>
.
Tip: As usual, use $(pwd)
to
automatically include the full path in front of a relative
file, e.g., candle aggregate-results
$(pwd)/last-candle-job
.
candle
CommandsAs long as the candle
module is loaded (module
load candle
), the available commands to the candle
program (in the format candle <COMMAND>
<COMMAND-ARG-1> <COMMAND-ARG-2> ...
) are
as follows:
candle import-template
<grid|bayesian|r|bash> |
Copy a CANDLE template
to the current directory |
candle generate-grid
<PYTHON-LIST-1> <PYTHON-LIST-2> ... |
Generate a hyperparameter
grid for the grid search workflow |
candle submit-job <INPUT-FILE>
|
Submit a CANDLE job |
candle aggregate-results
<EXP-DIR> [<RESULT-FORMAT>] |
Create a CSV file called candle_results.csv
containing the hyperparameters and corresponding
performance metrics |
Tip: Leaving <COMMAND>
blank or
setting it to help
will display this usage menu.
If you've successfully used CANDLE to advance your work and you're willing to tell us about it, please email the SDSI team to tell us what you've done! We'd love to learn how users are using CANDLE to address their needs so that we can continue to improve CANDLE and its implementation on Biowulf.
Further, if you're willing to have your work promoted online,
please include an exemplary graphic of your work, and upon
review we'll post it here as an exemplar CANDLE success
story. More exposure for you, more exposure for us!
Or, if you've unsuccessfully used CANDLE to advance your work, we'd love to help you out; please let us know what didn't work for you!
Feel free to email the SDSI team with any
questions, comments, or suggestions.