Biowulf High Performance Computing at the NIH
abruijn on Biowulf

ABruijn is a assembler for long reads from, for example, PacBio and Oxford Nanopore Technologies sequencers. It uses an A-Bruijn graph to find the overlaps between reads without error correction. It then produces a draft assembly from a subset of raw reads which is then polished into a high quality assembly using all reads. A 5Mb bacterial genome with ~80x coverage was be assembled on one of our compute nodes (10GB memory; 16 CPUs) in about 30min. A 150 Mb D. melanogaster genome was assembled in 2 days (520GB memory; 32 CPUs).

abruijn 2.x has been renamed to Flye.

References:

Documentation
Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Sample session:


[user@biowulf]$ sinteractive --mem=6g --cpus-per-task=16 --gres=lscratch:10
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job
[user@cn3144 ~]$ module load abruijn
[user@cn3144 ~]$ cd /lscratch/$SLURM_JOB_ID
[user@cn3144 ~]$ zcat $ABRUIJN_TEST_DATA/SRR1284073_gt10k.fasta.gz > SRR1284073_gt10k.fasta
[user@cn3144 ~]$ abruijn -t $SLURM_CPUS_PER_TASK -p pacbio \
    SRR1284073_gt10k.fasta assembly_ecoli 80
[21:03:38] INFO: Running ABruijn
[21:03:38] INFO: Assembling reads
[21:03:49] INFO: Counting kmers (1/2):
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
[21:04:40] INFO: Counting kmers (2/2):
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
[21:04:54] INFO: Building kmer index
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
[21:05:23] INFO: Finding overlaps:
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
[21:07:07] INFO: Extending reads
[21:07:10] INFO: Assembled 1 contigs
[21:07:10] INFO: Generating contig sequences
[21:07:39] INFO: Polishing genome (1/2)
[21:07:39] INFO: Running BLASR
[21:10:53] INFO: Separating draft genome into bubbles
[21:14:51] INFO: Correcting bubbles
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
[21:27:29] INFO: Running BLASR
[21:29:46] INFO: Separating draft genome into bubbles
[21:33:58] INFO: Correcting bubbles
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
[21:39:15] INFO: Done! Your assembly is in file: /lscratch/12345/assembly_ecoli

[user@cn3144 ~]$ ll assembly_ecoli
total 2.3G                                                      
-rw-rw-r-- 1 user group 1.1M Jan 10 21:39 abruijn.log
-rw-rw-r-- 1 user group   42 Jan 10 21:39 abruijn.save
-rw-rw-r-- 1 user group 798M Jan 10 21:10 blasr_1.m5
-rw-rw-r-- 1 user group 839M Jan 10 21:29 blasr_2.m5
-rw-rw-r-- 1 user group 268M Jan 10 21:14 bubbles_1.fasta
-rw-rw-r-- 1 user group 316M Jan 10 21:34 bubbles_2.fasta
-rw-rw-r-- 1 user group 7.2M Jan 10 21:27 consensus_1.fasta
-rw-rw-r-- 1 user group  13M Jan 10 21:39 consensus_2.fasta
-rw-rw-r-- 1 user group 4.8M Jan 10 21:07 draft_assembly.fasta
-rw-rw-r-- 1 user group 4.6M Jan 10 21:27 polished_1.fasta
-rw-rw-r-- 1 user group 4.6M Jan 10 21:39 polished_2.fasta
-rw-rw-r-- 1 user group 4.9M Jan 10 21:07 reads_order.fasta

[user@cn3144 ~]$ cp -r assembly_ecoli /data/$USER/badbadproject
[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. abruijn.sh) similar to the following:

#! /bin/bash
ml abruijn || exit 1

cd /lscratch/$SLURM_JOB_ID
cp /data/users/some/where/reads.fa .
abruijn -t $SLURM_CPUS_PER_TASK -p pacbio -k17 \
    reads.fa assembly_dmelanogaster 90
cp -r assembly_dmelanogaster /data/user/badbadproject

This particular example made use of data set SRX499318 filtered to reads >14k length resulting in a 90x coverage of the ~150Mb D. melanogaster genome.

Submit this job using the Slurm sbatch command.

sbatch --mem=600g --cpus-per-task=32 --partition=largemem --gres=lscratch:300 abruijn.batch --time=3-00:00:00

This job ran for ~2 days and used up to 520GB of memory. Here is the profile of memory and running threads for this assembly:

resource usage profile

The final result was an assembly of 146.7Mb with 170 contigs and N50 = 5.33Mb.