flye on Biowulf
Flye is a de novo assembler for long and noisy reads, such as those produced by PacBio and Oxford Nanopore Technologies. The algorithm uses an A-Bruijn graph to find the overlaps between reads and does not require them to be error-corrected. After the initial assembly, Flye performs an extra repeat classification and analysis step to improve the structural accuracy of the resulting sequence. The package also includes a polisher module, which produces the final assembly of high nucleotide-level quality.

Flye replaces abruijn and does provide a abruijn script for backwards compatibility.

A 5Mb bacterial genome with ~80x coverage was assembled on one of our compute nodes (6GB memory; 16 CPUs) in about 30min. A ~150 Mb D. melanogaster genome was assembled in 13h (100GB memory; 32 CPUs).

References:

Documentation
Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Sample session:

Note that --genome-size is optional since version 2.8


[user@biowulf]$ sinteractive --mem=14g --cpus-per-task=16 --gres=lscratch:10
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job
[user@cn3144 ~]$ module load flye/2.9.5
[user@cn3144 ~]$ cd /lscratch/$SLURM_JOB_ID
[user@cn3144 ~]$ mkdir assembly_ecoli
[user@cn3144 ~]$ cp $FLYE_TEST_DATA/SRR1284073_gt10k.fasta.gz .
[user@cn3144 ~]$ flye -t $SLURM_CPUS_PER_TASK --pacbio-raw SRR1284073_gt10k.fasta.gz \
                     -o assembly_ecoli --genome-size 5m
[2024-09-25 20:17:21] INFO: Starting Flye 2.9.5-b1801
[2024-09-25 20:17:21] INFO: >>>STAGE: configure
[2024-09-25 20:17:21] INFO: Configuring run
[2024-09-25 20:17:23] INFO: Total read length: 424263329
[2024-09-25 20:17:23] INFO: Input genome size: 5000000
[2024-09-25 20:17:23] INFO: Estimated coverage: 84
[2024-09-25 20:17:23] INFO: Reads N50/N90: 17480 / 11579
[...snip...]
[2024-09-25 20:27:41] INFO: >>>STAGE: finalize
[2024-09-25 20:27:41] INFO: Assembly statistics:

        Total length:   4775724
        Fragments:      5
        Fragments N50:  2914006
        Largest frg:    2914006
        Scaffolds:      0
        Mean coverage:  65


[user@cn3144 ~]$ ll assembly_ecoli
total 9.4M
drwxr-xr-x 2 user group 4.0K Sep 25 20:20 00-assembly
drwxr-xr-x 2 user group 4.0K Sep 25 20:21 10-consensus
drwxr-xr-x 2 user group 4.0K Sep 25 20:23 20-repeat
drwxr-xr-x 2 user group 4.0K Sep 25 20:23 30-contigger
drwxr-xr-x 2 user group 4.0K Sep 25 20:27 40-polishing
-rw-r--r-- 1 user group 4.7M Sep 25 20:27 assembly.fasta
-rw-r--r-- 1 user group 4.6M Sep 25 20:27 assembly_graph.gfa
-rw-r--r-- 1 user group 2.6K Sep 25 20:27 assembly_graph.gv
-rw-r--r-- 1 user group  229 Sep 25 20:27 assembly_info.txt
-rw-r--r-- 1 user group  69K Sep 25 20:27 flye.log
-rw-r--r-- 1 user group   93 Sep 25 20:27 params.json


[user@cn3144 ~]$ # copy back to data
[user@cn3144 ~]$ cp -r assembly_ecoli /data/$USER/badbadproject
[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. flye.sh) similar to the following:

#! /bin/bash
ml flye/2.9.5 || exit 1

cd /lscratch/$SLURM_JOB_ID
cp /data/users/some/where/reads.fa.gz .
flye -t $SLURM_CPUS_PER_TASK --pacbio-raw reads.fasta.gz -o assembly_dmelanogaster --genome-size 150m
mv assembly_dmelanogaster /data/$USER/badbadproject

This particular example made use of data set SRX499318 filtered to reads >14k length resulting in a 90x coverage of the ~150Mb D. melanogaster genome.

Submit this job using the Slurm sbatch command.

sbatch --mem=120g --cpus-per-task=32 --gres=lscratch:300 flye.batch --time=1-00:00:00

This job ran for ~13h and used up to 100GB of memory. Here is the profile of memory and running threads for this assembly:

resource usage profile

The final result was an assembly of 137Mb with 357 contigs and a scaffold N50 of 6.34Mb.