Biowulf High Performance Computing at the NIH
flye on Biowulf
Flye is a de novo assembler for long and noisy reads, such as those produced by PacBio and Oxford Nanopore Technologies. The algorithm uses an A-Bruijn graph to find the overlaps between reads and does not require them to be error-corrected. After the initial assembly, Flye performs an extra repeat classification and analysis step to improve the structural accuracy of the resulting sequence. The package also includes a polisher module, which produces the final assembly of high nucleotide-level quality.

Flye does provide a abruijn script for backwards compatibility.

A 5Mb bacterial genome with ~80x coverage was assembled on one of our compute nodes (6GB memory; 16 CPUs) in about 30min. A ~150 Mb D. melanogaster genome was assembled in 13h (100GB memory; 32 CPUs).

flye is the successor of Abruijn.

References:

Documentation
Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Sample session:


[user@biowulf]$ sinteractive --mem=6g --cpus-per-task=16 --gres=lscratch:10
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job
[user@cn3144 ~]$ module load flye
[user@cn3144 ~]$ cd /lscratch/$SLURM_JOB_ID
[user@cn3144 ~]$ zcat $ABRUIJN_TEST_DATA/SRR1284073_gt10k.fasta.gz > SRR1284073_gt10k.fasta
[user@cn3144 ~]$ flye -t $SLURM_CPUS_PER_TASK --pacbio-raw SRR1284073_gt10k.fasta \
                     -o assembly_ecoli --genome-size 5m
[2018-04-03 16:08:46] INFO: Running Flye 2.3.3-g0fc9012
[2018-04-03 16:08:46] INFO: Assembling reads
[2018-04-03 16:08:46] INFO: Running with k-mer size: 15
[2018-04-03 16:08:46] INFO: Reading sequences
[2018-04-03 16:08:53] INFO: Reads N50/90: 17480 / 11579
[2018-04-03 16:08:53] INFO: Selected minimum overlap 5000
[2018-04-03 16:08:53] INFO: Expected read coverage: 80
[...snip...]
[2018-04-03 16:31:44] INFO: Correcting bubbles
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
[2018-04-03 16:36:05] INFO: Assembly statistics:

        Total length:   4636855
        Contigs:        1
        Scaffolds:      1
        Scaffolds N50:  4636855
        Largest scf:    4636855
        Mean coverage:  52

[2018-04-03 16:36:05] INFO: Final assembly: /lscratch/46116226/assembly_ecoli/scaffolds.fasta

[user@cn3144 ~]$ ll assembly_ecoli
total 9.1M
drwxrwxr-x 2 user group 4.0K Apr  3 12:13 0-assembly
drwxrwxr-x 2 user group 4.0K Apr  3 12:21 1-consensus
drwxrwxr-x 2 user group 4.0K Apr  3 12:23 2-repeat
drwxrwxr-x 2 user group 4.0K Apr  3 12:36 3-polishing
-rw-rw-r-- 1 user group  193 Apr  3 12:23 assembly_graph.dot
-rw-rw-r-- 1 user group   79 Apr  3 12:36 assembly_info.txt
-rw-rw-r-- 1 user group 4.5M Apr  3 12:36 contigs.fasta
-rw-rw-r-- 1 user group  21K Apr  3 12:36 flye.log
-rw-rw-r-- 1 user group   26 Apr  3 12:36 flye.save
-rw-rw-r-- 1 user group 4.5M Apr  3 12:36 scaffolds.fasta


[user@cn3144 ~]$ # copy back to data
[user@cn3144 ~]$ cp -r assembly_ecoli /data/$USER/badbadproject
[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. flye.sh) similar to the following:

#! /bin/bash
ml flye || exit 1

cd /lscratch/$SLURM_JOB_ID
cp /data/users/some/where/reads.fa.gz .
flye -t $SLURM_CPUS_PER_TASK --pacbio-raw reads.fasta.gz -o assembly_dmelanogaster --genome-size 150m
mv assembly_dmelanogaster /data/$USER/badbadproject

This particular example made use of data set SRX499318 filtered to reads >14k length resulting in a 90x coverage of the ~150Mb D. melanogaster genome.

Submit this job using the Slurm sbatch command.

sbatch --mem=120g --cpus-per-task=32 --gres=lscratch:300 flye.batch --time=1-00:00:00

This job ran for ~13h and used up to 100GB of memory. Here is the profile of memory and running threads for this assembly:

resource usage profile

The final result was an assembly of 137Mb with 357 contigs and a scaffold N50 of 6.34Mb.