High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
Canu on NIH HPC Systems
Canu is a fork of the Celera Assembler designed for high-noise single-molecule sequencing (such as the PacBio RSII or Oxford Nanopore MinION). Canu will correct the reads, then trim suspicious regions (such as remaining SMRTbell adapter), then assemble the corrected and cleaned reads into unitigs.

Canu was developed by Adam Phillippy, Sergey Koren, Brian Walenz. (Canu website)

Test data for Canu is available in
/usr/local/apps/canu/p6.25x.fastq (223 MB)

On Helix

Sample session on Helix. Note that the Max Memory is set to 8 GB, and the Max Threads is set to 4. Since Helix is a shared interactive system, you should never use more than 4 threads, and a max of 50 GB of memory. Thus, it is only useful to run on Helix for a very small test job.

Batch job on Biowulf

The following example runs the same job as described on Helix above. It does not use the grid options of canu

#!/bin/bash
#  this file is called canu_nogrid.sh

cd /data/susanc/canu
module load canu/1.1

canu \
 -p ecoli -d ecoli-auto \
 -genomeSize=4.8m \
 -pacbio-raw p6.25x.fastq \
    usegrid=0 \
    -maxMemory=$(( SLURM_MEM_PER_NODE - 1 )) \
    -maxThreads=$SLURM_CPUS_PER_TASK \

Submit this job using the Slurm sbatch command. Note that the variable $SLURM_CPUS_PER_TASK is used within the batch file to specify the number of threads that the program should spawn. This variable is set by Slurm when the job runs, and matches the value specified in --cpus-per-task=# in the sbatch command below.

sbatch --cpus-per-task=8 --mem=10g canu_nogrid.bat

Batch jobs using grid options of Canu

In most cases users will want to use the grid options of Canu to distribute the work. Run the Canu command on the Biowulf login node or in an interactive session, and Canu will submit the jobs appropriately. For example:

[susanc@biowulf canu]$ module load canu

[susanc@biowulf canu]$ canu -p asm -d lambda -genomeSize=50k -pacbio-raw \
>        p6.25x.fastq \
>        minReadLength=500 minOverlapLength=500 stopOnReadQuality=false \
>        usegrid=1 \
>        gridOptions="--time=30:00 --partition quick" \
>        gridOptionsJobName=lam
-- Detected Java(TM) Runtime Environment '1.8.0_11' (from 'java').
-- Detected 30 CPUs and 118 gigabytes of memory.
-- Detected Slurm with 'sinfo' binary in /usr/local/slurm/bin/sinfo.
--
-- Found  16 hosts with  12 cores and   45 GB memory under Slurm control.
-- Found 594 hosts with  32 cores and  124 GB memory under Slurm control.
-- Found  64 hosts with  32 cores and  124 GB memory under Slurm control.
-- Found  24 hosts with  32 cores and  251 GB memory under Slurm control.
-- Found 384 hosts with  24 cores and   22 GB memory under Slurm control.
-- Found 250 hosts with   8 cores and    6 GB memory under Slurm control.
-- Found 103 hosts with  32 cores and   30 GB memory under Slurm control.
-- Found  16 hosts with  16 cores and   69 GB memory under Slurm control.
-- Found  16 hosts with  16 cores and   69 GB memory under Slurm control.
-- Found 295 hosts with  16 cores and   22 GB memory under Slurm control.
-- Found   4 hosts with  64 cores and 1008 GB memory under Slurm control.
-- Found  64 hosts with  32 cores and   61 GB memory under Slurm control.
-- Found 295 hosts with  32 cores and   61 GB memory under Slurm control.
--
-- Allowed to run under grid control, and use up to   4 compute threads and    3 GB memory for stage 'bogart (unitigger)'.
-- Allowed to run under grid control, and use up to   8 compute threads and    6 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run under grid control, and use up to   8 compute threads and    6 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run under grid control, and use up to   8 compute threads and    6 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run under grid control, and use up to   4 compute threads and    3 GB memory for stage 'read error detection (overlap error adjustment)'.
-- Allowed to run under grid control, and use up to   1 compute thread  and    2 GB memory for stage 'overlap error adjustment'.
-- Allowed to run under grid control, and use up to   4 compute threads and    8 GB memory for stage 'utgcns (consensus'.
-- Allowed to run under grid control, and use up to   1 compute thread  and    2 GB memory for stage 'overlap store parallel bucketizer'.
-- Allowed to run under grid control, and use up to   1 compute thread  and    2 GB memory for stage 'overlapper'.
-- Allowed to run under grid control, and use up to   8 compute threads and    6 GB memory for stage 'overlapper'.
-- Allowed to run under grid control, and use up to   8 compute threads and    6 GB memory for stage 'overlapper'.
-- Allowed to run under grid control, and use up to   4 compute threads and    4 GB memory for stage 'meryl (k-mer counting)'.
-- Allowed to run under grid control, and use up to   4 compute threads and    6 GB memory for stage 'falcon_sense (read correction)'.
-- Allowed to run under grid control, and use up to   8 compute threads and    6 GB memory for stage 'minimap (overlapper)'.
-- Allowed to run under grid control, and use up to   8 compute threads and    6 GB memory for stage 'minimap (overlapper)'.
-- Allowed to run under grid control, and use up to   8 compute threads and    6 GB memory for stage 'minimap (overlapper)'.
----------------------------------------
-- Starting command on Fri Mar 25 14:37:09 2016 with 114.6 GB free disk space

    sbatch \
      --mem=4g \
      --cpus-per-task=1 \
      --time=30:00 \
      --partition quick  \
      -D `pwd` \
      -J "canu_asm_lam" \
      -o /data/susanc/canu/lambda/canu-scripts/canu.01.out /data/susanc/canu/lambda/canu-scripts/canu.01.sh
16390160

-- Finished on Fri Mar 25 14:37:09 2016 (lickety-split) with 114.6 GB free disk space
----------------------------------------

At various times, the 'sjobs' command will show different Canu jobs running or pending....

[susanc@biowulf canu]$  sjobs
User    JobId     JobName       Part   St  Reason      Runtime  Walltime  Nodes  CPUs  Memory    Dependency          Nodelist
================================================================
susanc  16390181  canu_asm_lam  quick  PD  Dependency     0:00     30:00      1     1  4GB/node  afterany:16390180_*
================================================================

[susanc@biowulf canu]$ sjobs
User    JobId         JobName       Part   St  Reason      Runtime  Walltime  Nodes  CPUs  Memory    Dependency          Nodelist
================================================================
susanc  16390277_[1]  meryl_asm_la  quick  PD  ---            0:00     30:00      1     4  4GB/node
susanc  16390278      canu_asm_lam  quick  PD  Dependency     0:00     30:00      1     1  4GB/node  afterany:16390277_*
================================================================

Documentation