Biowulf High Performance Computing at the NIH
RELION on Biowulf

RELION (for REgularised LIkelihood OptimisatioN, pronounce rely-on) is a stand-alone computer program that employs an empirical Bayesian approach to refinement of (multiple) 3D reconstructions or 2D class averages in electron cryo-microscopy (cryo-EM).

References:

Documentation
Important Notes

Dependencies

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program.
Sample session (user input in bold):

[user@biowulf]$ sinteractive --constraint=gpuk80 --gres=gpu:k80:4 --ntasks=4 --nodes=1 --ntasks-per-node=4 --mem-per-cpu=4g --cpus-per-task=2
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn4242 are ready for job

[user@cn4242 ~]$ module load RELION
[user@cn4242 ~]$ ln -s /fdb/app_testdata/cryoEM/RELION/tutorials/relion21_tutorial/betagal/Micrographs .
[user@cn4242 ~]$ mkdir -p import
[user@cn4242 ~]$ relion_star_loopheader rlnMicrographMovieName > import/movies.star
[user@cn4242 ~]$ ls Micrographs/*.mrcs >> import/movies.star
[user@cn4242 ~]$ mkdir -p output
[user@cn4242 ~]$ mpirun -n ${SLURM_NTASKS} relion_run_motioncorr_mpi --i import/movies.star --o output/ --save_movies  --first_frame_sum 1 --last_frame_sum 16 --use_motioncor2 --bin_factor 1 --motioncor2_exe $RELION_MOTIONCOR2_EXECUTABLE --bfactor 150 --angpix 3.54 --patch_x 5 --patch_y 5 --gpu "" --dose_weighting --voltage 300 --dose_per_frame 1 --preexposure 0

[user@cn4242 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$

Please note:

Allocating more than one node for an interactive, command-line driven RELION process will NOT run. Multi-node jobs must be submitted to the batch system.

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. RELION.sh). For example:

#!/bin/bash

#SBATCH --constraint=gpup100
#SBATCH --ntasks=20
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=5
#SBATCH --cpus-per-task=4
#SBATCH --mem-per-cpu=6g
#SBATCH --partition=gpu
#SBATCH --gres=gpu:p100:4,lscratch:200
#SBATCH --error=output/run.err
#SBATCH --output=output/run.out
#SBATCH --time=1-00:00:00
#SBATCH --distribution=arbitrary

module load RELION/2.1.0

mkdir output
ln -s /fdb/app_testdata/cryoEM/plasmodium_ribosome/Particles .
ln -s /fdb/app_testdata/cryoEM/plasmodium_ribosome/emd_2660.map .

source restrict_tcp_for_mpi.sh
source add_extra_MPI_task.sh

srun --mpi=pmi2 relion_refine_mpi \
  --o output/run \
  --i Particles/shiny_2sets.star \
  --ref emd_2660.map:mrc \
  --firstiter_cc \
  --ini_high 60 \
  --dont_combine_weights_via_disc \
  --scratch_dir /lscratch/${SLURM_JOB_ID} \
  --pool 100 \
  --ctf \
  --ctf_corrected_ref \
  --iter 10 \
  --tau2_fudge 4 \
  --particle_diameter 360 \
  --K 6 \
  --flatten_solvent \
  --zero_mask \
  --oversampling 1 \
  --healpix_order 2 \
  --offset_range 5 \
  --offset_step 2 \
  --sym C1 \
  --norm \
  --scale  \
  --j 1 \
  --random_seed 0 \
  --gpu

Submit this job using the Slurm sbatch command.

sbatch RELION.sh

Please note:

GUI Interactive jobs

Interactive use of RELION via the GUI requires an graphical X11 connection. NX works well for Windows users, while XQuartz works well for Mac users.

Start an interactive session on the Biowulf cluster. For example, this allocates 16 CPUs, 32GB of memory, 200GB of local scratch space, and 16 hours of time:

sinteractive --cpus-per-task=16 --mem-per-cpu=2g --gres=lscratch:200 --time=16:00:00

load the RELION module and start up the GUI:

module load RELION
relion

This should start the main GUI window:

main

Jobs that are suitable for running on the interactive host can be run directly from the GUI. For example, running CTF:

direct_run

Once the job parameters are defined, just click 'Run now!'.

If the RELION process run on the local host of an interactive session is MPI-enabled, the number of MPI procs set in the GUI must match the number of tasks allocated for the job.

By default, an interactive session allocates a single task. This means that by default, only a single MPI proc can be run from the GUI. To start an interactive session with the capability of handling multiple MPI procs, add --ntasks and --nodes=1 to the sinteractive command, and adjust --cpus-per-task accordingly:

sinteractive --cpus-per-task=1 --nodes=1 --ntasks=16 --mem-per-cpu=2g --gres=lscratch:200 --time=16:00:00
GUI Batch jobs

Jobs that should be run on different host(s) can be run on the batch system by choosing the appropriate parameters. Here is a job that will allocate 513 MPI tasks, each with 4 CPUs per task, for a total of 2052 CPUs. The CPUs will have the x2695 property, meaning they will be Intel E5-2695v3 processors. Each CPU will have access to 10 GB of RAM memory. Each node will have 200 GB of local scratch space available to the job, and the total time alloted for the job to complete is 5 days.

batch_submit
Sbatch template files

There is one pre-made sbatch template file, /usr/local/apps/RELION/templates/common.sh, as set by the environment variable $RELION_QSUB_TEMPLATE.

#!/bin/bash
#SBATCH --ntasks=XXXmpinodesXXX
#SBATCH --partition=XXXqueueXXX
#SBATCH --cpus-per-task=XXXthreadsXXX
#SBATCH --error=XXXerrfileXXX
#SBATCH --output=XXXoutfileXXX
#SBATCH --time=XXXextra1XXX
#SBATCH --mem-per-cpu=XXXextra2XXX
#SBATCH --gres=XXXextra3XXX
#SBATCH XXXextra4XXX
#SBATCH XXXextra5XXX
source restrict_tcp_for_mpi.sh
source add_extra_MPI_task.sh
srun --mpi=pmi2 XXXcommandXXX

By including SBATCH directives in the GUI, all combinations of resources are possible with the single script.

User-created template scripts can be substituted into the 'Standard submission script' box under the Running tab.

script

Alternatively, other templates can be browsed by clicking the 'Browse' button:

browse
Running Modes

In order to understand how RELION accelerates its computation, we must understand two different concepts.

Multi-threading is when an executing process spawns multiple threads, or subprocesses, which share the same common memory space, but occupy independent CPUs.

Distributed tasks is the coordination of multiple independent processes by a single "master" process via a communication protocol. MPI, or message passing interface, is the protocol by which these independent tasks are coordinated within RELION.

An MPI task can multi-thread, since each task is itself an independent process.

While all RELION job types can run with a single task in single-threaded mode, some can distribute their tasks via MPI. And a subset of those job types can further accelerate their computation by running those MPI tasks in multi-threaded mode.

For example, the Import job type can only run single task, single-threaded:

import

The CTF job type can run with multiple, distributed tasks, but each single-threaded:

CTF

The MotionCor2 job type can run with multiple distributed tasks, each of which can run multi-threaded:

MotionCor2

There are separate, distinct executables for running single-task and multi-task mode! If the "Number of MPI procs" value is left as one, then the single-task executable will be used:

 relion_run_motioncorr --i import/movies.star --o output/ ...  

If the value of "Number of MPI procs" is set to a value greater than one, then the MPI-enabled, distributed task executable will be used:

 relion_run_motioncorr_mpi --i import/movies.star --o output/ ...  

MPI-enabled executables must be launched properly to ensure proper distribution! When running in batch on the HPC cluster, the MPI-enabled executable should be launched with srun --mpi=pmi2. This allows the MPI-enabled executable to discover what CPUs and nodes are available for tasks based on the Slurm environment:

 srun --mpi=pmi2 relion_run_motioncorr_mpi --i import/movies.star --o output/ ...  

When running within an interactive session, the MPI-enabled executable should be launched with mpirun -n ${SLURM_NTASKS}:

 mpirun -n ${SLURM_NTASKS} relion_run_motioncorr_mpi --i import/movies.star --o output/ ...  
Understanding MPI Task Distribution

RELION jobs using MPI-enabled executables can distribute their MPI tasks in one of three modes:

Certain job types benefit from these distributions. Refinements run on GPU nodes should use homogenous+1 distribution, while motion correction using MotionCor2 or GCTF should use homogenous distribution. Jobs run on CPU-only nodes can use hetergeneous distribution.

The distribution mode is dictated by additional SBATCH directives set in the 'Running' tab.

Heterogeneous distribution: has no special requirements, and is the default.

Because the number of nodes and the distribution of MPI tasks on those nodes is not known prior to submission, it is best to set the amount of memory allocated as Memory Per Thread, or --mem-per-cpu in the batch script.

heterogeneous distribution
#!/bin/bash
#SBATCH --ntasks=257
#SBATCH --partition=multinode
#SBATCH --cpus-per-task=4
#SBATCH --error=run.err
#SBATCH --output=run.out
#SBATCH --time=1-00:00:00
#SBATCH --mem-per-cpu=8g
#SBATCH --gres=lscratch:200
#SBATCH
#SBATCH
source restrict_tcp_for_mpi.sh
srun --mpi=pmi2 ... RELION command here ...

Visually, this distribution would look something like this:

heterogeneous distribution model

The white boxes represent MPI tasks, the yellow dots represent CPUs allocated to the MPI tasks, and the black dots are CPUs not allocated to the job. Because no constraints are placed on where tasks can be allocated via the --ntasks-per-node option, the MPI tasks distribute themselves wherever the slurm batch system finds room.

Homogeneous distribution requires:

Obviously the number of MPI procs MUST equal --nodes times --ntasks-per-node. In this case 8 nodes, each with 4 MPI tasks per node, gives 32 MPI tasks total.

GPU-only:

In this case, because we are allocating all 4 GPUs on the gpu node with gpu:k80:4, it is probably best to allocate all the memory on the node as well, using both --mem-per-cpu and --mem.

homogeneous distribution
#!/bin/bash
#SBATCH --ntasks=32
#SBATCH --partition=gpu --constraint=gpuk80
#SBATCH --cpus-per-task=2
#SBATCH --error=run.err
#SBATCH --output=run.out
#SBATCH --time=1-00:00:00
#SBATCH --mem-per-cpu=30g --mem=240g
#SBATCH --gres=lscratch:200,gpu:k80:4
#SBATCH --nodes=8 --ntasks-per-node=4
#SBATCH
source restrict_tcp_for_mpi.sh
srun --mpi=pmi2 ... RELION command here ...

A visual representation of this distribution would be:

homogeneous distribution model

The white boxes represent MPI tasks, the yellow dots represent CPUs allocated to the MPI tasks, and the black dots are CPUs not allocated to the job. In this case, the GPU devices are represented by blue boxes, and each MPI task is explicitly mapped to a given GPU device. When running MotionCor2, only a single CPU of each MPI task actually generates load, so only a single CPU of the 4 allocated to each MPI task is active.

Homogenous+1 distribution requires:

GPU-only:

For homogeneous+1 distribution, the total number of MPI procs is more than necessary, and must equal to the number of nodes times the number of tasks per node. --ntasks-per-node is set to 5, and --nodes is set to 8, so the total number of tasks is set to 40.

homogeneous+1 distribution

The batch script now contains a special source file, add_extra_MPI_task.sh, which creates the $SLURM_HOSTFILE and distributes the MPI tasks in an arbitrary fashion.

#!/bin/bash
#SBATCH --ntasks=40
#SBATCH --partition=gpu --constraint=gpuk80
#SBATCH --cpus-per-task=2
#SBATCH --error=run.err
#SBATCH --output=run.out
#SBATCH --time=1-00:00:00
#SBATCH --mem-per-cpu=24g --mem=240g
#SBATCH --gres=lscratch:200,gpu:k80:4
#SBATCH --nodes=8 --ntasks-per-node=5
#SBATCH --distribution=arbitrary
source restrict_tcp_for_mpi.sh
source add_extra_MPI_task.sh
srun --mpi=pmi2 ... RELION command here ...

Visually, this distribution would look something like this:

homogeneous distribution model with master

The white boxes represent MPI tasks, the yellow dots represent CPUs allocated to the MPI tasks, and the black dots are CPUs not allocated to the job. Again, the GPU devices are represented by blue boxes, and each MPI task is explicitly mapped to a given GPU device. However, in this case, the master MPI task (as part of a RELION job) is allocated an MPI task, but does not do much and does not utilize a GPU device, so its CPUs are colored red.

Version 2 and GPUs

Certain job-types (2D classification, 3D classification, and refinement) can benefit tremendously by using GPUs. Under the Compute tab, set 'Use GPU acceleration?' to 'Yes', and leave 'Which GPUs to use' blank:

using GPUs

The job must be configured to allocate GPUs. This can be done by setting input in the 'Running tab'.

gpu_even

For more information about the GPUs available on the HPC/Biowulf cluster, see https://hpc.nih.gov/systems/.

Motion correction

There are two external applications used for running motion correction within RELION: MotionCor2 and UnBlur.

MotionCor2

By default, RELION uses MotionCor2 from Shawn Zheng of UCSF. This requires GPUs to run. Several steps must be done to ensure success. If running MotionCor2 within an interactive session, there must be at least one GPU allocated. Otherwise, GPUs must be allocated within a batch job from the GUI.

Motioncorr tab:

  • The default version of MotionCor2 is v1.0.4.
  • Make sure that the path to MotionCor2 is correct, and the answer to 'Is this MOTIONCOR2?' is 'Yes':
  • Make sure that "Which GPUs to use" is blank under the 'Motiocorr' tab.
  • Set all the other parameters as required.
MotionCor2

UnBlur

To use the non-GPU motion correction software UnBlur from Niko Grigorieff of HHMI/Janelia, click the 'Unblur' tab and set the answer to 'Use UNBLUR instead?' to 'Yes':

UnBlur
CTF estimation

There are multiple applications and versions available for doing CTF estimation.

CTFFIND4.1.x

Under the CTFFIND-4.1 tab, change the answer to 'Use CTFFIND-4.1?' to 'Yes'.

CTFFIND4.1

GCTF

Under the Gctf tab, change the answer to 'Use Gctf instead?' to 'Yes'. Keep in mind that GCTF requires GPUs.

GCTF
Local scratch space

Long-running multi-node jobs can benefit from copying input data into local scratch space. The benefits stem from both increased I/O performance and the prevention of disruptions due to unforseen traffic on shared filesystems. Under the Compute tab, insert /lscratch/$SLURM_JOB_ID into the 'Copy particles to scratch directory' input:

compute tab with lscratch

Make sure that the total size of your particles can fit within the allocated local scratch space, as set in the 'Local Scratch Disk Space' input under the Running tab.

running tab with lscratch
Multinode use

When running RELION on multiple CPUs, keep in mind both the partition (queue) and the nodes within that partition. Several of the partitions have subsets of nodetypes. Having a large RELION job running across different nodetypes may be detrimental. To select a specific nodetype, include --constraint in the "Additional SBATCH Directives" input. For example, --constraint x2680 would be a good choice for the multinode partition.

running tab with constraint

Please read https://hpc.nih.gov/policies/multinode.html for a discussion on making efficient use of multinode partition.

MPI tasks versus threads

In benchmarking tests, RELION MPI jobs scale about the same as the number of CPUs increase, regardless of the combination of MPI procs and threads per MPI process. That is, a 3D classification job with 512 MPI procs and 2 threads per MPI proc runs about the same as with 128 MPI procs and 8 threads per MPI proc. Both utilize 1024 CPUs. At present, it is not clear under what circumstances is it beneficial to increase the number of threads per MPI process beyond 8.

X11 display

The RELION GUI requires an X11 server to display, as well as X11 Fowarding. We recommend using either NX (Windows) or XQuartz (Mac) as X11 servers.

Running on the login node

Running RELION on the login node is not allowed. Please allocate an interactive node instead.

Extra sbatch options

Additional sbatch options can be placed in the Additional SBATCH Directives: text boxes.

addl sbatch options

RELION allows additional options to be added to the command line as well:

addl RELION options
Pre-reading particles into memory

Under certain circumstances, for example when the total size of input particles is small, pre-reading the particles into memory can improve performance. The amount of memory required depends on the number of particles (N) and the box_size:

N * box_size * box_size * 4 / (1024 * 1024 * 1024), in GB per MPI task.

Thus, 100,000 particles of box size 350 pixels would need ~43 GB of RAM per MPI task. This would reasonably fit on GPU nodes (240 GB) when running 17 tasks across 4 nodes, as the first node would have 5 MPI tasks for a total of 5*43, or 215 GB.

Under the Compute tab, change 'Pre-read all particles into RAM?' to 'Yes':

pre-read into memory
Sample files

A few sample sets have been downloaded from https://www.ebi.ac.uk/pdbe/emdb/empiar/ for testing purposes. They are located here:

/fdb/app_testdata/cryoEM/
Known problems

There are several known problems with RELION.