flye on Biowulf

Quick Links

Flye is a de novo assembler for long and noisy reads, such as those produced by PacBio and Oxford Nanopore Technologies. The algorithm uses an A-Bruijn graph to find the overlaps between reads and does not require them to be error-corrected. After the initial assembly, Flye performs an extra repeat classification and analysis step to improve the structural accuracy of the resulting sequence. The package also includes a polisher module, which produces the final assembly of high nucleotide-level quality.

Flye replaces abruijn and does provide a abruijn script for backwards compatibility.

A 5Mb bacterial genome with ~80x coverage was assembled on one of our compute nodes (6GB memory; 16 CPUs) in about 30min. A ~150 Mb D. melanogaster genome was assembled in 13h (100GB memory; 32 CPUs).

References:

Yu Lina, Jeffrey Yuana, Mikhail Kolmogorova, Max W. Shena, Mark Chaissonb, and Pavel A. Pevzner. Assembly of long error-prone reads using de Bruijn graphs PNAS 2016, 27:E8396-E8405 PubMed | PMC | Journal

Documentation

GitHub

Important Notes

Module Name: flye (see the modules page for more information)
flye is a multithreaded application
Example files can be found in $FLYE_TEST_DATA

Interactive job

Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Sample session:

Note that --genome-size is optional since version 2.8

[user@biowulf]$ sinteractive --mem=14g --cpus-per-task=16 --gres=lscratch:10
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job
[user@cn3144 ~]$ module load flye/2.9.5
[user@cn3144 ~]$ cd /lscratch/$SLURM_JOB_ID
[user@cn3144 ~]$ mkdir assembly_ecoli
[user@cn3144 ~]$ cp $FLYE_TEST_DATA/SRR1284073_gt10k.fasta.gz .
[user@cn3144 ~]$ flye -t $SLURM_CPUS_PER_TASK --pacbio-raw SRR1284073_gt10k.fasta.gz \
                     -o assembly_ecoli --genome-size 5m
[2024-09-25 20:17:21] INFO: Starting Flye 2.9.5-b1801
[2024-09-25 20:17:21] INFO: >>>STAGE: configure
[2024-09-25 20:17:21] INFO: Configuring run
[2024-09-25 20:17:23] INFO: Total read length: 424263329
[2024-09-25 20:17:23] INFO: Input genome size: 5000000
[2024-09-25 20:17:23] INFO: Estimated coverage: 84
[2024-09-25 20:17:23] INFO: Reads N50/N90: 17480 / 11579
[...snip...]
[2024-09-25 20:27:41] INFO: >>>STAGE: finalize
[2024-09-25 20:27:41] INFO: Assembly statistics:

        Total length:   4775724
        Fragments:      5
        Fragments N50:  2914006
        Largest frg:    2914006
        Scaffolds:      0
        Mean coverage:  65


[user@cn3144 ~]$ ll assembly_ecoli
total 9.4M
drwxr-xr-x 2 user group 4.0K Sep 25 20:20 00-assembly
drwxr-xr-x 2 user group 4.0K Sep 25 20:21 10-consensus
drwxr-xr-x 2 user group 4.0K Sep 25 20:23 20-repeat
drwxr-xr-x 2 user group 4.0K Sep 25 20:23 30-contigger
drwxr-xr-x 2 user group 4.0K Sep 25 20:27 40-polishing
-rw-r--r-- 1 user group 4.7M Sep 25 20:27 assembly.fasta
-rw-r--r-- 1 user group 4.6M Sep 25 20:27 assembly_graph.gfa
-rw-r--r-- 1 user group 2.6K Sep 25 20:27 assembly_graph.gv
-rw-r--r-- 1 user group  229 Sep 25 20:27 assembly_info.txt
-rw-r--r-- 1 user group  69K Sep 25 20:27 flye.log
-rw-r--r-- 1 user group   93 Sep 25 20:27 params.json


[user@cn3144 ~]$ # copy back to data
[user@cn3144 ~]$ cp -r assembly_ecoli /data/$USER/badbadproject
[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$

Batch job

Most jobs should be run as batch jobs.

Create a batch input file (e.g. flye.sh) similar to the following:

#! /bin/bash
ml flye/2.9.5 || exit 1

cd /lscratch/$SLURM_JOB_ID
cp /data/users/some/where/reads.fa.gz .
flye -t $SLURM_CPUS_PER_TASK --pacbio-raw reads.fasta.gz -o assembly_dmelanogaster --genome-size 150m
mv assembly_dmelanogaster /data/$USER/badbadproject

This particular example made use of data set SRX499318 filtered to reads >14k length resulting in a 90x coverage of the ~150Mb D. melanogaster genome.

Submit this job using the Slurm sbatch command.

sbatch --mem=120g --cpus-per-task=32 --gres=lscratch:300 flye.batch --time=1-00:00:00

This job ran for ~13h and used up to 100GB of memory. Here is the profile of memory and running threads for this assembly:

The final result was an assembly of 137Mb with 357 contigs and a scaffold N50 of 6.34Mb.