High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
RELION

RELION (for REgularised LIkelihood OptimisatioN, pronounce rely-on) is a stand-alone computer program that employs an empirical Bayesian approach to refinement of (multiple) 3D reconstructions or 2D class averages in electron cryo-microscopy (cryo-EM).

References:

There are multiple versions of RELION available. An easy way of selecting the version is to use modules. To see the modules available, type

module avail RELION

To select a module, type

module load RELION/[ver]

where [ver] is the version of choice.

Environment variables set:

Dependencies

Interactive session on Biowulf

Interactive use of RELION requires an graphical X11 connection. NX is known to have problems. FastX works well for Windows users, while XQuartz works well for Mac users.

After first starting an interactive session on the Biowulf cluster, for example this allocates 12 CPUS, 1 GPU, 60GB of memory, 200GB of local scratch space, and 16 hours of time:

sinteractive --constraint=gpuk80 --cpus-per-task=12 --mem=60g --gres=lscratch:200,gpu:k80:1 --time=16:00:00

load the RELION module and start up the GUI:

module load RELION
relion

This should start the main GUI window:

start

Running local jobs from the GUI

Jobs that are suitable for running on the interactive host can be run directly from the GUI. For example, running CTF:

direct_run

Once the job parameters are defined, just click 'Run now!'.

Submitting batch jobs from the GUI

Jobs that should be run on different host(s) can be run on the batch system by choosing the appropriate parameters.

batch_submit
Batch job on Biowulf

Create a batch input file (e.g. RELION.sh) and submit via sbatch. For example:

#!/bin/bash

#SBATCH --job-name=class3D_4xK80
#SBATCH --nodes=4
#SBATCH --ntasks=16
#SBATCH --ntasks-per-node=4
#SBATCH --ntasks-per-core=1
#SBATCH --cpus-per-task=2
#SBATCH --mem=240g
#SBATCH --gres=lscratch:800,gpu:k80:4
#SBATCH --time=24:00:00
#SBATCH --partition=gpu
#SBATCH --qos=staff

module load RELION/2.0.6

mkdir output
ln -s /fdb/app_testdata/cryoEM/plasmodium_ribosome/Particles .
ln -s /fdb/app_testdata/cryoEM/plasmodium_ribosome/emd_2660.map .
mkdir /lscratch/${SLURM_JOB_ID}/tmpdir
export TMPDIR=/lscratch/${SLURM_JOB_ID}/tmpdir
export OMPI_MCA_btl="self,sm,tcp"
export OMPI_MCA_btl_tcp_if_include="10.2.0.0/18"
srun --mpi=pmi2 relion_refine_mpi \
  --o output/run \
  --i Particles/shiny_2sets.star \
  --ref emd_2660.map:mrc \
  --firstiter_cc \
  --ini_high 60 \
  --dont_combine_weights_via_disc \
  --scratch_dir /lscratch/${SLURM_JOB_ID} \
  --pool 100 \
  --ctf \
  --ctf_corrected_ref \
  --iter 10 \
  --tau2_fudge 4 \
  --particle_diameter 360 \
  --K 6 \
  --flatten_solvent \
  --zero_mask \
  --oversampling 1 \
  --healpix_order 2 \
  --offset_range 5 \
  --offset_step 2 \
  --sym C1 \
  --norm \
  --scale  \
  --j 1 \
  --random_seed 0 \
  --gpu

Submit this job using the Slurm sbatch command.

sbatch RELION.sh
Motion correction

There are three external applications used for running motion correction within RELION: MotionCor2, Motioncorr, and UnBlur.

MotionCor2

By default, RELION uses MotionCor2 from Shawn Zheng of UCSF. This requires GPUs to run. Several steps must be done to ensure success.

  • There must be at least one GPU allocated within the interactive session, or a GPU must be allocated as a batch job from the GUI.
  • Make sure that the path to MotionCor2 is correct, and the answer to 'Is this MOTIONCOR2?' is 'Yes':
  • Choose /usr/local/apps/RELION2.0/2.0.6/motioncor2_gpu.sh as the 'Standard submission script'. This will run MotionCor2 on 16 GPUs, one GPU per node.
  • Make sure that "Which GPUs to use" is blank under the 'Motiocorr' tab.
MotionCor2_a

NOTE: While the number of tasks (MPI procs) is fixed in the motioncor2_gpu.sh script, the value of 'MPI procs' in the GUI must be set to a value above 1 to force RELION to use an MPI-enabled executable. Do not leave the value of 'MPI procs' at 1.

MotionCorr

To use the other GPU-dependent motion correction software MotionCorr from Yifan Cheng of UCSF, do one of the following steps:

  • Load the motioncorr module after loading the RELION module, but prior to starting the GUI:
    module load motioncorr/2.1
  • Manually change the path to the correct MotionCorr executable (/usr/local/apps/RELION2.0/motioncorr_v2.1/bin/dosefgpu_driftcorr)
  • There must be at least one GPU allocated within the interactive session, or a GPU must be allocated as a batch job from the GUI.
  • Make sure that the path is correct, and the answer to 'Is this MOTIONCOR2?' is 'No':
Motioncorr

UnBlur

To use the non-GPU motion correction software UnBlur from Niko Grigorieff of HHMI/Janelia, click the 'Unblur' tab and set the answer to 'Use UNBLUR instead?' to 'Yes':

UnBlur
CTF estimation

There are multiple applications and versions available for doing CTF estimation.

CTFFIND3

This is the default.

CTFFIND3

CTFFIND4.0.x

This version can be made available by loading the ctffind module after loading the RELION module and prior to running the GUI:

module load ctffind/4.0.16
CTFFIND4.0

CTFFIND4.1.x

This version can be made available by loading the ctffind module after loading the RELION module and prior to running the GUI.

module load ctffind/4.1.5
CTFFIND4.1

In addition, change the answer to 'Is this a CTFFIND 4.1+ executable?' to 'Yes':

CTFFIND4.1_a

GCTF

Under the Gctf tab, change the answer to 'Use Gctf instead of CTFFIND?' to 'Yes'. Keep in mind that GCTF requires GPUs.

GCTF
Sbatch template files

There are several different pre-made sbatch template files for allocating resources for a batch job. The choice is highly dependent on the job needed to run.

Below is written the template script names. The path to these scripts depends on the value of the environment variable $RELION_QSUB_TEMPLATE. Once the RELION module is loaded, this variable points to the default template script. For example, the location of the version 2.0.6 default template file is shown here:

echo $RELION_QSUB_TEMPLATE
/usr/local/apps/RELION2.0/2.0.6/single_cpu.sh

In this case, append '/usr/local/apps/RELION2.0/2.0.6/' with the name of the script and insert this into the 'Standard submission script' box under the Running tab.

script

Alternatively, the installation directory for the given RELION version can be browsed by clicking the 'Browse' button:

browse

Single CPU, non-MPI

Script name: single_cpu.sh

This template is appropriate for single transformation steps that do not parallelize, such as:

  • Mask creation
  • Join star files
  • Particle subtraction
  • Post-processing

Multi-CPU, MPI

Script name: multi_cpu.sh

This template is appropriate for any parallelizable non-GPU job, such as:

  • Motion correction with UnBlur
  • CTF estimation without Gctf
  • Particle extraction
  • Particle sorting
  • Movie refinement
  • Particle polishing

Single GPU, non-MPI

Script name: single_gpu.sh

This template is appropriate for any non-parallelizable GPU-dependent step, such as:

  • Motion correction using MotionCor2 or Motioncorr
  • CTF estimation using Gctf

Multi-CPU, Multi-GPU (K80), MPI

Script names:

These scripts are appropriate for the very heavy lifting steps, such as:

  • Motion correction using MotionCor2 or Motioncorr
  • CTF estimation using Gctf
  • 2D, 3D classification
  • 3D auto-refine

NOTE: While the number of tasks (MPI procs) is fixed in these scripts, the value of 'MPI procs' in the GUI must be set to a value above 1 to force RELION to use an MPI-enabled executable. Do not leave the value of 'MPI procs' at 1.

Multi-CPU, Multi-GPU (K20), MPI

Script names:

The K20 nodes have 2 GPUs with 5GB GDDR5 memory per GPU. Their capacity is not as great as the K80s. These scripts can be used under the same circumstances as the K20s, such as:

  • Motion correction using MotionCor2 or Motioncorr
  • CTF estimation using Gctf
  • 2D, 3D classification
  • 3D auto-refine
Local scratch space

Long-running multi-node jobs can benefit from copying input data into local scratch space. The benefits stem from both increased I/O performance and the prevention of disruptions due to unforseen traffic on shared filesystems. Under the Compute tab, insert /lscratch/$SLURM_JOB_ID into the 'Copy particles to scratch directory' input:

using lscratch

Make sure that the total size of your particles can fit within the allocated local scratch space, as set in the 'Local Scratch Disk Space' input under the Running tab.

Multinode use

When running RELION on multiple CPUs, keep in mind both the partition (queue) and the nodes within that partition. Several of the partitions have subsets of nodetypes. Having a large RELION job running across different nodetypes may be detrimental. To select a specific nodetype, include --constraint in the extra sbatch options. For example, --constraint x2680 would be a good choice for the multinode partition.

Please read https://hpc.nih.gov/policies/multinode.html for a discussion on making efficient use of multinode partition.

NOTE: --constraint has been set in the GPU batch script templates. If --constraint is given in the extra sbatch options box, it will override the value set in the script.

MPI tasks versus threads

In benchmarking tests, RELION MPI jobs scale about the same as the number of CPUs increase, regardless of the combination of MPI procs and threads per MPI process. That is, a 3D classification job with 512 MPI procs and 2 threads per MPI proc runs about the same as with 128 MPI procs and 8 threads per MPI proc. Both utilize 1024 CPUs. At present, it is not clear under what circumstances is it beneficial to increase the number of threads per MPI process beyond 8.

Version 2 and GPUs

Certain job-types (2D classification, 3D classification, and refinement) can benefit tremendously by using GPUs. Under the Compute tab, set 'Use GPU acceleration?' to 'Yes':

using GPUs

If you do not understand how MPI procs and threads are distributed across nodes, leave 'Which GPUs to use' blank. See below for guidance.

X11 display

The RELION GUI requires an X11 server to display, as well as X11 Fowarding. Contrary to other applications, RELION does not work well with NX. Instead, we recommend using either FastX (Windows) or XQuartz (Mac) as X11 servers.

Running on the login node

It is possible, and in some cases desirable, to run the GUI on the Biowulf login node. This is acceptable if the RELION jobs are all run via the batch system. DO NOT run interactive jobs locally on the login node.

Extra sbatch options

Additional sbatch options can be placed behind 'sbatch' in the Queue submit command: text box.

addl sbatch options

DO NOT INSERT THE FOLLOWING OPTIONS, as they will interfere with the submission templates:
   --ntasks   --ntasks-per-node   --nodes   --partition   --cpus-per-task
   --mem   --mem-per-cpu   --error   --output   --time   --gres   --constraint

RELION allows additional options to be added to the command line as well:

addl RELION options
Pre-reading particles into memory

Under certain circumstances, for example when the total size of input particles is small (<10g), pre-reading the particles into memory can improve performance. Under the Compute tab, change 'Pre-read all particles into RAM?' to 'Yes':

pre-read into memory
Benchmarks

2D- and 3D-classification jobs were run with RELION v2.0.2, and the average time per iteration was calculated and plotted versus either the number of CPUs or GPUs.

2D Classification using CPUs only

  • The number of MPI ranks (--ntasks) was increased from 8 to 1536
  • The number of threads (--cpus-per-task) was fixed at 2
  • The --j value matched --cpus-per-task
  • Input images were copied to and read from local scratch
  • Time values for --ntasks < 8 were estimated due to walltime limits
  • Only ibfdr nodes in the multinode partition were used to maintain network homogeneity
#!/bin/bash
#SBATCH --ntasks=NNN
#SBATCH --cpus-per-task=2
#SBATCH --mem-per-cpu=12g
#SBATCH --gres=lscratch:100
#SBATCH --partition=multinode
#SBATCH --constraint=ibfdr

module load RELION/2.0.2

mkdir /lscratch/$SLURM_JOB_ID/TMPDIR
export TMPDIR=/lscratch/$SLURM_JOB_ID/TMPDIR

export OMPI_MCA_btl="self,sm,tcp"
export OMPI_MCA_btl_tcp_if_include="10.1.0.0/16,10.2.0.0/18"

srun --mpi=pmi2 `which relion_refine_mpi` --o Class2D/job001/run --i Particles/shiny_2sets.star \
  --dont_combine_weights_via_disc --scratch_dir /lscratch/$SLURM_JOB_ID --pool 100 --ctf --iter 4 \
  --tau2_fudge 2 --particle_diameter 360 --K 200 --flatten_solvent --zero_mask --oversampling 1 \
  --psi_step 6 --offset_range 5 --offset_step 2 --norm --scale --j 2 --random_seed 0
  

Takeways:

  • Efficiency drops below 60% at 512 CPUs
2D_PLOT

2D Classification using GPUs and CPUs

NOTE: Because the master MPI rank does not do perform any computation, the first GPU allocated is wasted. Thus, the minimum number of --ntasks for GPU jobs should be 2.

  • The number of MPI ranks (--ntasks) was increased from 2 to 33
  • The number of threads (--cpus-per-task) was fixed at 1
  • The --j value matched --cpus-per-task
  • Input images were copied to and read from local scratch
  • 4xGPU NVIDIA nodes were used on the ccrgpu partition
#!/bin/bash
#SBATCH --ntasks=NNN
#SBATCH --cpus-per-task=1
#SBATCH --ntasks-per-core=1
#SBATCH --ntasks-per-node=4
#SBATCH --mem=120g
#SBATCH --gres=lscratch:100,gpu:k80:4
#SBATCH --partition=ccrgpu

module load RELION/2.0.2

mkdir /lscratch/$SLURM_JOB_ID/TMPDIR
export TMPDIR=/lscratch/$SLURM_JOB_ID/TMPDIR

export OMPI_MCA_btl="self,sm,tcp"
export OMPI_MCA_btl_tcp_if_include="10.1.0.0/16,10.2.0.0/18"

srun --mpi=pmi2 `which relion_refine_mpi` --o Class2D/job001/run --i Particles/shiny_2sets.star \
  --dont_combine_weights_via_disc --scratch_dir /lscratch/$SLURM_JOB_ID --pool 100 --ctf --iter 5 \
  --tau2_fudge 2 --particle_diameter 360 --K 200 --flatten_solvent --zero_mask --oversampling 1 \
  --psi_step 6 --offset_range 5 --offset_step 2 --norm --scale --j 1 --random_seed 0 --gpu
  

Takeways:

  • Efficiency does not drop within the range of GPUs utilized
  • 8 GPUs are equivalent to 1024 CPUs
2D_GPU_PLOT

3D Classification using CPUs only

  • The number of MPI ranks (--ntasks) was increased from 8 to 512
  • The number of threads (--cpus-per-task) was fixed at 2
  • The --j value matched --cpus-per-task
  • Input images were copied to and read from local scratch
  • Time values for --ntasks < 6 were estimated due to walltime limits
  • Only ibfdr nodes in the multinode partition were used to maintain network homogeneity
#SBATCH --ntasks=NNN
#SBATCH --cpus-per-task=2
#SBATCH --mem-per-cpu=12g
#SBATCH --gres=lscratch:100
#SBATCH --partition=multinode
#SBATCH --constraint=ibfdr

module load RELION/2.0.2

mkdir /lscratch/$SLURM_JOB_ID/TMPDIR
export TMPDIR=/lscratch/$SLURM_JOB_ID/TMPDIR

export OMPI_MCA_btl="self,sm,tcp"
export OMPI_MCA_btl_tcp_if_include="10.1.0.0/16,10.2.0.0/18"

srun --mpi=pmi2 `which relion_refine_mpi` --o Class3D/job001/run --i Particles/shiny_2sets.star \
  --ref emd_2660.map:mrc --firstiter_cc --ini_high 60 --dont_combine_weights_via_disc \
  --scratch_dir /lscratch/$SLURM_JOB_ID --pool 100 --ctf --ctf_corrected_ref --iter 4 --tau2_fudge 4 \
  --particle_diameter 360 --K 6 --flatten_solvent --zero_mask --oversampling 1 --healpix_order 2 \
  --offset_range 5 --offset_step 2 --sym C1 --norm --scale --j 2 --random_seed 0 --gpu
  

Takeways:

  • Efficiency drops below 60% between 512 and 1024 CPUs
3D_PLOT

3D Classification using GPUs and CPUs

NOTE: Because the master MPI rank does not do perform any computation, the first GPU allocated is wasted. Thus, the minimum number of --ntasks for GPU jobs should be 2.

  • The number of MPI ranks (--ntasks) was increased from 2 to 33
  • The number of threads (--cpus-per-task) was fixed at 1
  • The --j value matched --cpus-per-task
  • Input images were copied to and read from local scratch
  • 4xGPU NVIDIA nodes were used on the ccrgpu partition
#SBATCH --ntasks=NNN
#SBATCH --ntasks-per-core=1
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=1
#SBATCH --mem=120g
#SBATCH --gres=lscratch:100,gpu:k80:4
#SBATCH --partition=ccrgpu

module load RELION/2.0.2

mkdir /lscratch/$SLURM_JOB_ID/TMPDIR
export TMPDIR=/lscratch/$SLURM_JOB_ID/TMPDIR

export OMPI_MCA_btl="self,sm,tcp"
export OMPI_MCA_btl_tcp_if_include="10.1.0.0/16,10.2.0.0/18"

srun --mpi=pmi2 `which relion_refine_mpi` --o Class3D/job001/run --i Particles/shiny_2sets.star \
  --ref emd_2660.map:mrc --firstiter_cc --ini_high 60 --dont_combine_weights_via_disc \
  --scratch_dir /lscratch/$SLURM_JOB_ID --pool 100 --ctf --ctf_corrected_ref --iter 4 --tau2_fudge 4 \
  --particle_diameter 360 --K 6 --flatten_solvent --zero_mask --oversampling 1 --healpix_order 2 \
  --offset_range 5 --offset_step 2 --sym C1 --norm --scale  --j 1 --random_seed 0 --gpu
  

Takeways:

  • Efficiency drop below 60% beyond 8 GPUs
  • 2 GPUs are equivalent to 512 CPUs
3D_GPU_PLOT
Sample files

A few sample sets have been downloaded from https://www.ebi.ac.uk/pdbe/emdb/empiar/ for testing purposes. They are located here:

/fdb/app_testdata/cryoEM/
Batch template scripts

These are the current sbatch template scripts. They are all located within /usr/local/apps/RELION/batch_template_scripts/. Please note these are for running under Slurm.

*** NOTE: While the number of tasks (MPI procs) is fixed in these scripts, the value of 'MPI procs' in the GUI must be set to a value above 1 to force RELION to use an MPI-enabled executable. Do not leave the value of 'MPI procs' at 1.

Documentation