ABruijn is a assembler for long reads from, for example, PacBio and Oxford Nanopore Technologies sequencers. It uses an A-Bruijn graph to find the overlaps between reads without error correction. It then produces a draft assembly from a subset of raw reads which is then polished into a high quality assembly using all reads. A 5Mb bacterial genome with ~80x coverage was be assembled on one of our compute nodes (10GB memory; 16 CPUs) in about 30min. A 150 Mb D. melanogaster genome was assembled in 2 days (520GB memory; 32 CPUs).
abruijn 2.x has been renamed to Flye.
References:
- Yu Lina, Jeffrey Yuana, Mikhail Kolmogorova, Max W. Shena, Mark Chaissonb, and Pavel A. Pevzner. Assembly of long error-prone reads using de Bruijn graphs PNAS 2016, 27:E8396-E8405 PubMed | PMC | Journal
- Module Name: abruijn (see the modules page for more information)
- abruijn is a multithreaded application
- Example files can be found in
$ABRUIJN_TEST_DATA
Allocate an interactive session and run the program. Sample session:
[user@biowulf]$ sinteractive --mem=6g --cpus-per-task=16 --gres=lscratch:10 salloc.exe: Pending job allocation 46116226 salloc.exe: job 46116226 queued and waiting for resources salloc.exe: job 46116226 has been allocated resources salloc.exe: Granted job allocation 46116226 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn3144 are ready for job [user@cn3144 ~]$ module load abruijn [user@cn3144 ~]$ cd /lscratch/$SLURM_JOB_ID [user@cn3144 ~]$ zcat $ABRUIJN_TEST_DATA/SRR1284073_gt10k.fasta.gz > SRR1284073_gt10k.fasta [user@cn3144 ~]$ abruijn -t $SLURM_CPUS_PER_TASK -p pacbio \ SRR1284073_gt10k.fasta assembly_ecoli 80 [21:03:38] INFO: Running ABruijn [21:03:38] INFO: Assembling reads [21:03:49] INFO: Counting kmers (1/2): 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% [21:04:40] INFO: Counting kmers (2/2): 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% [21:04:54] INFO: Building kmer index 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% [21:05:23] INFO: Finding overlaps: 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% [21:07:07] INFO: Extending reads [21:07:10] INFO: Assembled 1 contigs [21:07:10] INFO: Generating contig sequences [21:07:39] INFO: Polishing genome (1/2) [21:07:39] INFO: Running BLASR [21:10:53] INFO: Separating draft genome into bubbles [21:14:51] INFO: Correcting bubbles 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% [21:27:29] INFO: Running BLASR [21:29:46] INFO: Separating draft genome into bubbles [21:33:58] INFO: Correcting bubbles 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% [21:39:15] INFO: Done! Your assembly is in file: /lscratch/12345/assembly_ecoli [user@cn3144 ~]$ ll assembly_ecoli total 2.3G -rw-rw-r-- 1 user group 1.1M Jan 10 21:39 abruijn.log -rw-rw-r-- 1 user group 42 Jan 10 21:39 abruijn.save -rw-rw-r-- 1 user group 798M Jan 10 21:10 blasr_1.m5 -rw-rw-r-- 1 user group 839M Jan 10 21:29 blasr_2.m5 -rw-rw-r-- 1 user group 268M Jan 10 21:14 bubbles_1.fasta -rw-rw-r-- 1 user group 316M Jan 10 21:34 bubbles_2.fasta -rw-rw-r-- 1 user group 7.2M Jan 10 21:27 consensus_1.fasta -rw-rw-r-- 1 user group 13M Jan 10 21:39 consensus_2.fasta -rw-rw-r-- 1 user group 4.8M Jan 10 21:07 draft_assembly.fasta -rw-rw-r-- 1 user group 4.6M Jan 10 21:27 polished_1.fasta -rw-rw-r-- 1 user group 4.6M Jan 10 21:39 polished_2.fasta -rw-rw-r-- 1 user group 4.9M Jan 10 21:07 reads_order.fasta [user@cn3144 ~]$ cp -r assembly_ecoli /data/$USER/badbadproject [user@cn3144 ~]$ exit salloc.exe: Relinquishing job allocation 46116226 [user@biowulf ~]$
Create a batch input file (e.g. abruijn.sh) similar to the following:
#! /bin/bash ml abruijn || exit 1 cd /lscratch/$SLURM_JOB_ID cp /data/users/some/where/reads.fa . abruijn -t $SLURM_CPUS_PER_TASK -p pacbio -k17 \ reads.fa assembly_dmelanogaster 90 cp -r assembly_dmelanogaster /data/user/badbadproject
This particular example made use of data set SRX499318 filtered to reads >14k length resulting in a 90x coverage of the ~150Mb D. melanogaster genome.
Submit this job using the Slurm sbatch command.
sbatch --mem=600g --cpus-per-task=32 --partition=largemem --gres=lscratch:300 abruijn.batch --time=3-00:00:00
This job ran for ~2 days and used up to 520GB of memory. Here is the profile of memory and running threads for this assembly:

The final result was an assembly of 146.7Mb with 170 contigs and N50 = 5.33Mb.