High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
abruijn on Biowulf & Helix

Description

ABruijn is a assembler for long reads from, for example, PacBio and Oxford Nanopore Technologies sequencers. It uses an A-Bruijn graph to find the overlaps between reads without error correction. It then produces a draft assembly from a subset of raw reads which is then polished into a high quality assembly using all reads. A 5Mb bacterial genome with ~80x coverage was be assembled on one of our compute nodes (10GB memory; 16 CPUs) in about 30min. A 150 Mb D. melanogaster genome was assembled in 2 days (520GB memory; 32 CPUs).

There may be multiple versions of abruijn available. An easy way of selecting the version is to use modules. To see the modules available, type

module avail abruijn 

To select a module use

module load abruijn/[version]

where [version] is the version of choice.

abruijn is a multithreaded application. Make sure to match the number of cpus requested with the number of threads.

Environment variables set

Dependencies

Loaded automatically.

References

The test data was obtained from Kim et al..

Documentation

Interactive job on Biowulf

Allocate an interactive session with sinteractive and use as described below

biowulf$ sinteractive --mem=5g --cpus-per-task=16 --gres=lscratch:10
node$ module load abruijn
node$ cd /lscratch/$SLURM_JOB_ID
node$ zcat $ABRUIJN_TEST_DATA/SRR1284073_gt10k.fasta.gz > SRR1284073_gt10k.fasta
node$ abruijn -t $SLURM_CPUS_PER_TASK -p pacbio \
    SRR1284073_gt10k.fasta assembly_ecoli 80
[21:03:38] INFO: Running ABruijn
[21:03:38] INFO: Assembling reads
[21:03:49] INFO: Counting kmers (1/2):
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
[21:04:40] INFO: Counting kmers (2/2):
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
[21:04:54] INFO: Building kmer index
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
[21:05:23] INFO: Finding overlaps:
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
[21:07:07] INFO: Extending reads
[21:07:10] INFO: Assembled 1 contigs
[21:07:10] INFO: Generating contig sequences
[21:07:39] INFO: Polishing genome (1/2)
[21:07:39] INFO: Running BLASR
[21:10:53] INFO: Separating draft genome into bubbles
[21:14:51] INFO: Correcting bubbles
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
[21:27:29] INFO: Polishing genome (2/2)
[21:27:29] INFO: Running BLASR
[21:29:46] INFO: Separating draft genome into bubbles
[21:33:58] INFO: Correcting bubbles
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
[21:39:15] INFO: Done! Your assembly is in file: /lscratch/12345/assembly_ecoli
node$ ll assembly_ecoli
total 2.3G                                                      
-rw-rw-r-- 1 user group 1.1M Jan 10 21:39 abruijn.log
-rw-rw-r-- 1 user group   42 Jan 10 21:39 abruijn.save
-rw-rw-r-- 1 user group 798M Jan 10 21:10 blasr_1.m5
-rw-rw-r-- 1 user group 839M Jan 10 21:29 blasr_2.m5
-rw-rw-r-- 1 user group 268M Jan 10 21:14 bubbles_1.fasta
-rw-rw-r-- 1 user group 316M Jan 10 21:34 bubbles_2.fasta
-rw-rw-r-- 1 user group 7.2M Jan 10 21:27 consensus_1.fasta
-rw-rw-r-- 1 user group  13M Jan 10 21:39 consensus_2.fasta
-rw-rw-r-- 1 user group 4.8M Jan 10 21:07 draft_assembly.fasta
-rw-rw-r-- 1 user group 4.6M Jan 10 21:27 polished_1.fasta
-rw-rw-r-- 1 user group 4.6M Jan 10 21:39 polished_2.fasta
-rw-rw-r-- 1 user group 4.9M Jan 10 21:07 reads_order.fasta
node$ cp -r assembly_ecoli /data/user/badbadproject
node$ exit
biowulf$
Batch job on Biowulf

Create a batch script similar to the following example:

#! /bin/bash
# this file is abruijn.batch

ml abruijn || exit 1

cd /lscratch/$SLURM_JOB_ID
cp /data/users/some/where/reads.fa .
abruijn -t $SLURM_CPUS_PER_TASK -p pacbio -k17 \
    reads.fa assembly_dmelanogaster 90
cp -r assembly_dmelanogaster /data/user/badbadproject

This particular example made use of data set SRX499318 filtered to reads >14k length resulting in a 90x coverage of the ~150Mb D. melanogaster genome.

Submit to the queue with sbatch requesting sufficient memory and local scratch space:

biowulf$ sbatch --mem=600g --cpus-per-task=32 --partition=largemem --gres=lscratch:300 abruijn.batch

This job ran for ~2 days and used up to 520GB of memory. Here is the profile of memory and running threads for this assembly:

resource usage profile

The final result was an assembly of 146.7Mb with 170 contigs and N50 = 5.33Mb.