Biowulf High Performance Computing at the NIH
RELION on Biowulf

RELION (for REgularised LIkelihood OptimisatioN, pronounce rely-on) is a stand-alone computer program that employs an empirical Bayesian approach to refinement of (multiple) 3D reconstructions or 2D class averages in electron cryo-microscopy (cryo-EM).

References:

Documentation



Important Notes

RELION jobs MUST utilize local scratch in order to prevent filesystem performance degradation. If submitting through the GUI, PLEASE SET THE 'Copy particles to scratch directory' input to /lscratch/$SLURM_JOB_ID in the Compute tab:

compute tab with lscratch

If submitting from a batch script, local scratch can be utilized by including the option --scratch_dir /lscratch/$SLURM_JOB_ID in the command line.

NOTE: Do not include --mem in batch allocations. The Slurm batch system cannot accept both --mem-per-cpu and --mem in submissions. RELION is best run with --mem-per-cpu only.

Dependencies

GUI Interactive jobs

Interactive use of RELION via the GUI requires an graphical X11 connection. NX works well, while XQuartz sometimes works for Mac users.

Start an interactive session on the Biowulf cluster. For example, this allocates 16 CPUs, 32GB of memory, 200GB of local scratch space, and 16 hours of time (--no-gres-shell is required to allow jobs steps to access the general resources like /lscratch and GPUs):

sinteractive --cpus-per-task=16 --mem-per-cpu=2g --gres=lscratch:200 --time=16:00:00 --no-gres-shell

load the RELION module and start up the GUI:

[user@cn1234 ~]$ cd /path/to/your/RELION/project/
[user@cn1234 project]$ module load RELION
[user@cn1234 project]$ relion

This should start the main GUI window:

main

Jobs that are suitable for running on the interactive host can be run directly from the GUI. For example, running CTF:

direct_run

Once the job parameters are defined, just click 'Run now!'.

If the RELION process run on the local host of an interactive session is MPI-enabled, the number of MPI procs set in the GUI must match the number of tasks allocated for the job.

By default, an interactive session allocates a single task. This means that by default, only a single MPI proc can be run from the GUI. To start an interactive session with the capability of handling multiple MPI procs, add --ntasks and --nodes=1 to the sinteractive command, and adjust --cpus-per-task accordingly:

sinteractive --cpus-per-task=1 --nodes=1 --ntasks=16 --mem-per-cpu=2g --gres=lscratch:200 --time=16:00:00 --no-gres-shell
GUI Batch jobs

Jobs that should be run on different host(s) can be launched from a generic interactive session and run on the batch system by choosing the appropriate parameters. The interactive session does not need elaborate resources, although it must be submitted from within a graphical X11 session on the login node.

sinteractive

Once the interactive session has started, the GUI can be launched from the project directory like so:

[user@cn1234 ~]$ cd /path/to/your/RELION/project/
[user@cn1234 project]$ module load RELION
[user@cn1234 project]$ relion

Here is a job that will allocate 512 MPI tasks, each with 8 CPUs per task, for a total of 4096 CPUs. The CPUs will have the x2695 property, meaning they will be Intel E5-2695v3 processors. Each CPU will have access to 4 GB of RAM memory. Each node will have 400 GB of local scratch space available to the job, and the total time alloted for the job to complete is 5 days. See here for a set of recommended parameters for each job type.

batch_submit
Recommended parameters

Choosing the appropriate parameters for GUI batch jobs on HPC Biowulf can be very complicated. Below is a relatively straightforward guide to those parameters based on job type.

In all cases below the amount of walltime allocated is an estimate. Your time may vary depending mainly on the number of particles and the number of MPI procs. More particles and fewer MPI procs means more time required.

MotionCor2 with GPUs:

Motion tab:

Use RELION's own implementation?no
MOTIONCOR2 executable:/usr/local/apps/MotionCor2/1.3.0/MotionCor2
Use GPU acceleration?yes
Which GPUs to use:-- leave blank --

Running tab:

Number of MPI procs: 4, 8, 12, 16, OR 20 (multiple of 4)
Number of threads:1
Submit to queue?yes
Queue name:gpu
Walltime:8:00:00
Memory Per Thread:20g
Gres:gpu:p100:4 OR gpu:k80:4 OR gpu:v100:4
SBATCH Directives:--ntasks-per-node=4

NOTE 1: Gres: can be substituted with gpu:k20x:2, but if this is done then set SBATCH Directives: to --ntasks-per-node=2.

NOTE 2: Consider using RELION's own motion correction implementation (below). On the average, it takes ~10x longer to allocate 8 GPUs than it does 512 CPUs.

NOTE 3: There are other versions of MotionCor2. To use these, load the different MotionCor2 module after loading the RELION module, but before running the relion command.

MotCorRel on CPUs only:

Motion tab:

Use RELION's own implementation?yes
Use GPU acceleration?no

Running tab:

Number of MPI procs: 128-2048
Number of threads: 1
Submit to queue? yes
Queue name: multinode
Walltime: 8:00:00
Memory Per Thread: 16g

NOTE 1: RELION's own implementation can require a large amount of memory per thread. If the job fails with memory errors, likely you will need to increase the amount beyond 16g, perhaps up to 64g. Memory usage can be monitored in the dashboard.

NOTE 2: Likely the job will complete in a few hours. If you want it to complete sooner, you can increase the number of MPI procs. However, the more you request, the longer the job will sit waiting for resources to become available.

CTFRefinement using cttfind4:

CTFFIND-4.1 tab:

Use CTFFIND-4.1? yes
CTFFIND-4.1 executable: /usr/local/apps/ctffind/4.1.14/ctffind

Gctf tab:

Use Gctf instead? no

Running tab:

Number of MPI procs: 128
Submit to queue? yes
Queue name: multinode
Walltime: 2:00:00
Memory Per Thread: 1g

GCTF with GPU:

CTFFIND-4.1 tab:

Use CTFFIND-4.1? no

Gctf tab:

Use Gctf instead? no
Gctf executable: /usr/local/apps/Gctf/1.06/bin/Gctf
Which GPUs to use:-- leave blank --

Running tab:

Number of MPI procs: 4, 8, 12, 16, OR 20 (multiple of 4)
Number of threads:1
Submit to queue?yes
Queue name:gpu
Walltime:8:00:00
Memory Per Thread:20g
Gres:gpu:p100:4 OR gpu:k80:4 OR gpu:v100:4
SBATCH Directives:--ntasks-per-node=4

NOTE 1: Gres: can be substituted with gpu:k20x:2, but if this is done then set SBATCH Directives: to --ntasks-per-node=2.

NOTE 2: Consider using RELION's own motion correction implementation (below). On the average, it takes ~10x longer to allocate 8 GPUs than it does 512 CPUs.

Class2D & Class3D on GPUs:

Compute tab:

Copy particles to scratch directory:lscratch/$SLURM_JOB_ID
Use GPU acceleration?yes
Which GPUs to use: -- leave blank --

Running tab:

Number of MPI procs: 20
Number of threads: 1
Submit to queue? yes
Queue name: gpu
Walltime: 1-00:00:00
Memory Per Thread: 20g
Gres: lscratch:400,gpu:p100:4
SBATCH Directives: --nodes 4 --ntasks-per-node=5
SBATCH Directives: --distribution=arbitrary

NOTE 1: The number of MPI procs = (ntasks-per-node X nodes). See here for more details.

NOTE 2: It is critical that enough local scratch is allocated to accomodate the particle data. 400 GB is the minimum, it can be larger.

NOTE 3: While this example shows p100, other gpu nodetypes can be substituted. See here for possible substitutes.

NOTE 4: Gres: can be substituted with gpu:k20x:2, but if this is done then set SBATCH Directives: to --ntasks-per-node=3.

NOTE 5: The sbatch option --distribution=arbitrary enables slurm to place one extra task on the first node, see here for more information.

NOTE 6: Increasing the number of threads above 1 might lower the amount of time required, but at the risk of overloading the GPUs and causing the job the stall. See here for more information.

Class2D & Class3D on CPUs:

Compute tab:

Copy particles to scratch directory:lscratch/$SLURM_JOB_ID
Use GPU acceleration?no

Running tab:

Number of MPI procs:128-2048
Number of threads: 8
Submit to queue? yes
Queue name: multinode
Walltime:2-00:00:00
Memory Per Thread: 4g
Gres: lscratch:400

NOTE 1: The amount of memory per thread may need to be larger, depending on the size of particle. Memory usage can be monitored in the dashboard.

NOTE 2: The larger the number of MPI procs, the sooner the job will complete. However, the more you request, the longer the job will sit waiting for resources to become available.

NOTE 3: In tests, increasing the number of threads per MPI proc above 8 has not shown to significantly decrease running time.

3D auto-refine and Bayesian polishing:

Compute tab:

Copy particles to scratch directory:/lscratch/$SLURM_JOB_ID
Use GPU acceleration? no

Running tab:

Number of MPI procs: 65
Number of threads: 16
Submit to queue? yes
Queue name: multinode
Walltime: 2-00:00:00
Memory Per Thread: 4g
Gres: lscratch:400

NOTE 1:The amount of time required and memory needed is greatly reduced by increasing the number of threads per MPI proc. 16 is likely the highest possible number before complications occur.

NOTE 2:The number of MPI procs can be increased, but it must be an odd number.

Sbatch template files

There is one pre-made sbatch template file, /usr/local/apps/RELION/templates/common.sh, as set by the environment variable $RELION_QSUB_TEMPLATE.

#!/bin/bash
#SBATCH --ntasks=XXXmpinodesXXX
#SBATCH --partition=XXXqueueXXX
#SBATCH --cpus-per-task=XXXthreadsXXX
#SBATCH --error=XXXerrfileXXX
#SBATCH --output=XXXoutfileXXX
#SBATCH --open-mode=append
#SBATCH --time=XXXextra1XXX
#SBATCH --mem-per-cpu=XXXextra2XXX
#SBATCH --gres=XXXextra3XXX
#SBATCH XXXextra4XXX
#SBATCH XXXextra5XXX
#SBATCH XXXextra6XXX
source add_extra_MPI_task.sh
env | sort
srun --mem-per-cpu=XXXextra2XXX --mpi=pmix XXXcommandXXX

By including SBATCH directives in the GUI, all combinations of resources are possible with the single script.

User-created template scripts can be substituted into the 'Standard submission script' box under the Running tab.

script

Alternatively, other templates can be browsed by clicking the 'Browse' button:

browse

If the option --distribution=arbitrary is set as an additional SBATCH directive, then the add_extra_MPI_task.sh script will generate a file ($SLURM_HOSTFILE) that will manually override the distribution of MPI tasks across the allocated cpus:

#!/bin/bash
# Create SLURM_HOSTFILE, with one extra task on the head node

# Don't bother unless --distribution=arbitrary
if [[ -z $SLURM_DISTRIBUTION ]]; then
  [[ -n $SLURM_HOSTFILE ]] && export SLURM_HOSTFILE=""
  return
# If it ain't arbitrary, make sure it is block
elif [[ ! $SLURM_DISTRIBUTION =~ arbitrary ]]; then
  [[ $SLURM_DISTRIBUTION =~ cyclic ]] && export SLURM_DISTRIBUTION=block
  [[ -n $SLURM_HOSTFILE ]] && export SLURM_HOSTFILE=""
  return
fi

# Don't bother unless nodes have been allocated
if [[ -z $SLURM_JOB_NODELIST ]]; then
  [[ -n $SLURM_HOSTFILE ]] && export SLURM_HOSTFILE=""
  return
fi

# Don't bother unless multiple tasks have been allocated
if [[ -z $SLURM_NTASKS_PER_NODE ]]; then
  [[ -n $SLURM_HOSTFILE ]] && export SLURM_HOSTFILE=""
  return
elif [[ ${SLURM_NTASKS_PER_NODE} -lt 2 ]]; then
  [[ -n $SLURM_HOSTFILE ]] && export SLURM_HOSTFILE=""
  return
fi

# Don't bother unless there is more than one node
array=( $( scontrol show hostname $SLURM_JOB_NODELIST) )
file=$(mktemp --suffix .SLURM_JOB_NODELIST)

if [[ ${#array[@]} -eq 1 ]]; then
  for ((j=0;j<$((SLURM_NTASKS_PER_NODE));j++)); do
#    echo $j
    echo ${array[0]} >> $file
  done
else
  echo ${array[0]} > $file
  for ((i=0;i<${SLURM_JOB_NUM_NODES};i++)); do
    for ((j=0;j<$((SLURM_NTASKS_PER_NODE-1));j++)); do
      echo ${array[${i}]} >> $file
    done
  done
fi
export SLURM_HOSTFILE=$file
unset SLURM_NTASKS_PER_NODE
Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program.
Sample session (user input in bold):

[user@biowulf]$ sinteractive --gres=gpu:p100:4 --ntasks=4 --nodes=1 --ntasks-per-node=4 --mem-per-cpu=4g --cpus-per-task=2 --no-gres-shell
salloc.exe: Pending job allocation 123457890
salloc.exe: job 1234567890 queued and waiting for resources
salloc.exe: job 1234567890 has been allocated resources
salloc.exe: Granted job allocation 1234567890
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn1234 are ready for job

[user@cn1234 ~]$ module load RELION
[user@cn1234 ~]$ ln -s /fdb/app_testdata/cryoEM/RELION/tutorials/relion31_tutorial_precalculated_results/Movies .
[user@cn1234 ~]$ mkdir import
[user@cn1234 ~]$ relion_import  --do_movies  --optics_group_name "opticsGroup1" --angpix 0.885 --kV 200 \
  --Cs 1.4 --Q0 0.1 --beamtilt_x 0 --beamtilt_y 0 --i "Movies/*.tif" --odir Import --ofile movies.star
[user@cn1234 ~]$ mkdir output
[user@cn1234 ~]$ srun --oversubscribe --mpi=pmix relion_run_motioncorr_mpi --i Import/job001/movies.star --first_frame_sum 1 \
  --last_frame_sum 0 --use_motioncor2 --motioncor2_exe ${RELION_MOTIONCOR2_EXECUTABLE} --bin_factor 1 \
  --bfactor 150 --dose_per_frame 1.277 --preexposure 0 --patch_x 5 --patch_y 5 --gainref Movies/gain.mrc \
  --gain_rot 0 --gain_flip 0 --dose_weighting --gpu '' --o output/
...
...
[user@cn1234 ~]$ exit
salloc.exe: Relinquishing job allocation 1234567890
[user@biowulf ~]$

Please note:

Allocating more than one node for an interactive, command-line driven RELION process will NOT run. Multi-node jobs must be submitted to the batch system.

Batch job

Typically RELION batch jobs are submitted from the GUI. However, for those who insist on doing it themselves, here is an example of a batch input file (e.g. RELION.sh).

#!/bin/bash

#SBATCH --ntasks=20
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=5
#SBATCH --cpus-per-task=2
#SBATCH --mem-per-cpu=6g
#SBATCH --partition=gpu
#SBATCH --gres=gpu:p100:4,lscratch:200
#SBATCH --error=run.err
#SBATCH --output=run.out
#SBATCH --time=1-00:00:00
#SBATCH --distribution=arbitrary

module load RELION
source add_extra_MPI_task.sh

mkdir output
ln -s /fdb/app_testdata/cryoEM/plasmodium_ribosome/Particles .
ln -s /fdb/app_testdata/cryoEM/plasmodium_ribosome/emd_2660.map .

srun --mpi=pmix relion_refine_mpi \
  --i Particles/shiny_2sets.star \
  --o output/run \
  --ref emd_2660.map:mrc \
  --ini_high 60 \
  --pool 100 \
  --pad 2  \
  --ctf \
  --ctf_corrected_ref \
  --iter 25 \
  --tau2_fudge 4 \
  --particle_diameter 360 \
  --K 4 \
  --flatten_solvent \
  --zero_mask \
  --oversampling 1 \
  --healpix_order 2 \
  --offset_range 5 \
  --offset_step 2 \
  --sym C1 \
  --norm \
  --scale \
  --j 1  
  --gpu "" \
  --dont_combine_weights_via_disc \
  --scratch_dir /lscratch/${SLURM_JOB_ID}

Submit this job using the Slurm sbatch command.

sbatch RELION.sh

Please note:

Running Modes

In order to understand how RELION accelerates its computation, we must understand two different concepts.

Multi-threading is when an executing process spawns multiple threads, or subprocesses, which share the same common memory space, but occupy independent CPUs.

Distributed tasks is the coordination of multiple independent processes by a single "master" process via a communication protocol. MPI, or message passing interface, is the protocol by which these independent tasks are coordinated within RELION.

An MPI task can multi-thread, since each task is itself an independent process.

While all RELION job types can run with a single task in single-threaded mode, some can distribute their tasks via MPI. And a subset of those job types can further accelerate their computation by running those MPI tasks in multi-threaded mode.

For example, the Import job type can only run single task, single-threaded:

import

The CTF job type can run with multiple, distributed tasks, but each single-threaded:

CTF

The MotionCor2 job type can run with multiple distributed tasks, each of which can run multi-threaded:

MotionCor2

There are separate, distinct executables for running single-task and multi-task mode! If the "Number of MPI procs" value is left as one, then the single-task executable will be used:

 relion_run_motioncorr --i import/movies.star --o output/ ...  

If the value of "Number of MPI procs" is set to a value greater than one, then the MPI-enabled, distributed task executable will be used:

 relion_run_motioncorr_mpi --i import/movies.star --o output/ ...  

MPI-enabled executables must be launched properly to ensure proper distribution! When running in batch on the HPC cluster, the MPI-enabled executable should be launched with srun --mpi=pmix. This allows the MPI-enabled executable to discover what CPUs and nodes are available for tasks based on the Slurm environment:

 srun --mpi=pmix relion_run_motioncorr_mpi --i import/movies.star --o output/ ...  
Understanding MPI Task Distribution

RELION jobs using MPI-enabled executables can distribute their MPI tasks in one of three modes:

Certain job types benefit from these distributions. Classification jobs run on GPU nodes should use homogenous+1 distribution, while motion correction using MotionCor2 or GCTF should use homogenous distribution. Jobs run on CPU-only nodes can use heterogeneous distribution.

The distribution mode is dictated by additional SBATCH directives set in the 'Running' tab.

Heterogeneous distribution: has no special requirements, and is the default.

Because the number of nodes and the distribution of MPI tasks on those nodes is not known prior to submission, it is best to set the amount of memory allocated as Memory Per Thread, or --mem-per-cpu in the batch script.

heterogeneous distribution
#!/bin/bash
#SBATCH --ntasks=257
#SBATCH --partition=multinode
#SBATCH --cpus-per-task=4
#SBATCH --error=run.err
#SBATCH --output=run.out
#SBATCH --time=1-00:00:00
#SBATCH --mem-per-cpu=8g
#SBATCH --gres=lscratch:200

srun --mpi=pmix ... RELION command here ...

Visually, this distribution would look something like this:

heterogeneous distribution model

The white boxes represent MPI tasks, the yellow dots represent CPUs allocated to the MPI tasks, and the black dots are CPUs not allocated to the job. Because no constraints are placed on where tasks can be allocated via the --ntasks-per-node option, the MPI tasks distribute themselves wherever the slurm batch system finds room.

Homogeneous distribution requires:

Obviously the number of MPI procs MUST equal --nodes times --ntasks-per-node. In this case 8 nodes, each with 4 MPI tasks per node, gives 32 MPI tasks total.

GPU-only:

In this case, because we are allocating all 4 GPUs on the gpu node with gpu:p100:4, it is probably best to allocate all the memory on the node as well, using --mem-per-cpu.

homogeneous distribution
#!/bin/bash
#SBATCH --ntasks=32
#SBATCH --partition=gpu
#SBATCH --cpus-per-task=2
#SBATCH --error=run.err
#SBATCH --output=run.out
#SBATCH --time=1-00:00:00
#SBATCH --mem-per-cpu=15g
#SBATCH --gres=lscratch:200,gpu:p100:4
#SBATCH --nodes=8 --ntasks-per-node=4

srun --mpi=pmix ... RELION command here ...

A visual representation of this distribution would be:

homogeneous distribution model

The white boxes represent MPI tasks, the yellow dots represent CPUs allocated to the MPI tasks, and the black dots are CPUs not allocated to the job. In this case, the GPU devices are represented by blue boxes, and each MPI task is explicitly mapped to a given GPU device. When running MotionCor2, only a single CPU of each MPI task actually generates load, so only a single CPU of the 4 allocated to each MPI task is active.

Homogeneous+1 distribution requires:

GPU-only:

For homogeneous+1 distribution, the total number of MPI procs is more than necessary, and must equal to the number of nodes times the number of tasks per node. --ntasks-per-node is set to 5, and --nodes is set to 8, so the total number of tasks is set to 40.

homogeneous+1 distribution

The batch script now contains a special source file, add_extra_MPI_task.sh, which creates the $SLURM_HOSTFILE and distributes the MPI tasks in an arbitrary fashion.

#!/bin/bash
#SBATCH --ntasks=40
#SBATCH --partition=gpu
#SBATCH --cpus-per-task=2
#SBATCH --error=run.err
#SBATCH --output=run.out
#SBATCH --time=1-00:00:00
#SBATCH --mem-per-cpu=12g 
#SBATCH --gres=lscratch:200,gpu:p100:4
#SBATCH --nodes=8 --ntasks-per-node=5
#SBATCH --distribution=arbitrary

source add_extra_MPI_task.sh
srun --mpi=pmix ... RELION command here ...

Visually, this distribution would look something like this:

homogeneous distribution model with master

The white boxes represent MPI tasks, the yellow dots represent CPUs allocated to the MPI tasks, and the black dots are CPUs not allocated to the job. Again, the GPU devices are represented by blue boxes, and each MPI task is explicitly mapped to a given GPU device. However, in this case, the master MPI task (as part of a RELION job) is allocated an MPI task, but does not do much and does not utilize a GPU device, so its CPUs are colored red.

Using GPUs

Certain job-types (2D classification, 3D classification, and refinement) can benefit tremendously by using GPUs. Under the Compute tab, set 'Use GPU acceleration?' to 'Yes', and leave 'Which GPUs to use' blank:

using GPUs

The job must be configured to allocate GPUs. This can be done by setting input in the 'Running tab'.

gpu_even

For homogeneous distribution, the total number of MPI tasks should equal:

<number of GPUs per node> X <number of nodes>

and the value of --ntasks-per-node should be equal to <number of GPUs per node>

For homogeneous+1 distribution (requires --distribution=arbitrary), the total number of MPI tasks should equal:

(<number of GPUs per node + 1>) X <number of nodes>

and the value of --ntasks-per-node should be equal to <number of GPUs per node + 1>

For more information about the GPUs available on the HPC/Biowulf cluster, see https://hpc.nih.gov/systems/.

Motion correction

RELION's own CPU-only implementation

By default, RELION comes supplied with a built-in motion correction tool:

MotCorRelion

This tool does not require GPUs, but can run multi-threaded.

CPU-only motion correction can require a lot of memory, minimally 8g per cpu.

Each MPI process minimally needs:

width * height * (frame + 2 + X) * 4 bytes per MPI task.

where X is at least the number of threads per MPI process. For example, using relion_image_handler --stats --i to display the statistics of a in input .tif file,

049@EMPIAR-10204/Movies/20170630_3_00029_frameImage.tif : (x,y,z,n)= 3710 x 3838 x 1 x 1 ; avg= 1.0305 stddev= 1.01565 minval= 0 maxval= 75; angpix = 1

the MPI process would minimally require

3710 * 3838 * (49 + 2 + 1) * 4, or ~ 2.7 GB.

However, due to other factors, this can be multiplied by 4-10 times.

MotionCor2

There is also an external application that can be used, MotionCor2 from Shawn Zheng of UCSF. This requires GPUs to run. Several steps must be done to ensure success. If running MotionCor2 within an interactive session, there must be at least one GPU allocated. Otherwise, GPUs must be allocated within a batch job from the GUI.

MotionCor2
CTF estimation

There are multiple applications and versions available for doing CTF estimation.

CTFFIND-4

Under the CTFFIND-4.1 tab, change the answer to 'Use CTFFIND-4.1?' to 'Yes'.

CTFFIND4.1

Gctf

Under the CTFFIND-4.1 tab, change the answer to 'Use CTFFIND-4.1?' to 'No'.

GCTF

Under the Gctf tab, change the answer to 'Use Gctf instead?' to 'Yes'. Keep in mind that GCTF requires GPUs.

GCTF
Local scratch space

Long-running multi-node jobs can benefit from copying input data into local scratch space. The benefits stem from both increased I/O performance and the prevention of disruptions due to unforseen traffic on shared filesystems. Under the Compute tab, insert /lscratch/$SLURM_JOB_ID into the 'Copy particles to scratch directory' input:

compute tab with lscratch

Make sure that the total size of your particles can fit within the allocated local scratch space, as set in the 'Gres' input under the Running tab.

running tab with lscratch

The batch script should contain the option --scratch_dir /lscratch/$SLURM_JOB_ID.

Multinode use

When running RELION on multiple CPUs, keep in mind both the partition (queue) and the nodes within that partition. Several of the partitions have subsets of nodetypes. Having a large RELION job running across different nodetypes may be detrimental. To select a specific nodetype, include --constraint in the "Additional SBATCH Directives" input. For example, --constraint x2680 would be a good choice for the multinode partition.

running tab with constraint

Please read https://hpc.nih.gov/policies/multinode.html for a discussion on making efficient use of multinode partition.

MPI tasks versus threads

In benchmarking tests, RELION classification (2D & 3D) MPI jobs scale about the same as the number of CPUs increase, regardless of the combination of MPI procs and threads per MPI process. That is, a 3D classification job with 512 MPI procs and 2 threads per MPI proc runs about the same as with 128 MPI procs and 8 threads per MPI proc. Both utilize 1024 CPUs. However, refinement MPI jobs run dramatically faster with 16 threads per MPI proc than with 1.

X11 display

The RELION GUI requires an X11 server to display, as well as X11 Fowarding. We recommend using either NX (Windows) or XQuartz (Mac) as X11 servers.

Running on the login node

Running RELION on the login node is not allowed. Please allocate an interactive node instead.

Extra sbatch options

Additional sbatch options can be placed in the Additional SBATCH Directives: text boxes.

addl sbatch options

RELION allows additional options to be added to the command line as well:

addl RELION options
Running with --exclusive

While batch jobs running on the same node do not share CPUs, memory or local scratch space, they do share network interfaces to filesystems and other nodes (and can share GPUs by mistake). If one job on a node generates a heavy demand on these interfaces (e.g. performing lots of reads/writes to shared disk space, communicating a huge amount of packets between other nodes), then the other jobs on that node may suffer. To alleviate this, a job can be run with the --exclusive flag. It has been found that in general RELION jobs do best if run exclusively.

This can be enabled by including --exclusive in the additional SBATCH directives boxes:

addl RELION options

NOTE: --exclusive does not automatically allocate all the cpus/memory/lscratch on the node! Make sure that you designate the node resources needed, e.g.

addl RELION options

In this case, 4 nodes with 56 CPUs each are allocated. 4g of memory per CPU = 224Gb of RAM per node. To see resources available per node, type freen at the commandline.

Pre-reading particles into memory

Under certain circumstances, for example when the total size of input particles is small, pre-reading the particles into memory can improve performance. The amount of memory required depends on the number of particles (N) and the box_size:

N * box_size * box_size * 4 / (1024 * 1024 * 1024), in GB per MPI task.

Thus, 100,000 particles of box size 350 pixels would need ~43 GB of RAM per MPI task. This would reasonably fit on GPU nodes (240 GB) when running 17 tasks across 4 nodes, as the first node would have 5 MPI tasks for a total of 5*43, or 215 GB.

Under the Compute tab, change 'Pre-read all particles into RAM?' to 'Yes':

pre-read into memory
Sample files

A few sample sets have been downloaded from https://www.ebi.ac.uk/pdbe/emdb/empiar/ for testing purposes. They are located here:

/fdb/app_testdata/cryoEM/
Known problems

There are several known problems with RELION.