Biowulf High Performance Computing at the NIH
medusa on Biowulf

A draft genome scaffolder that uses multiple reference genomes in a graph-based approach.



Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program.
Sample session (user input in bold):

[user@biowulf]$ sinteractive
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144 ~]$ module load medusa
[user@cn3144 ~]$ cp -r $MEDUSA_TEST_DATA/* .
[user@cn3144 ~]$ medusa --help
medusa --help
Medusa version 1.6
usage: java -jar medusa.jar -i inputfile -v
available options:
 -d                                    OPTIONAL PARAMETER;The option *-d*
                                       allows for the estimation of the
                                       distance between pairs of contigs
                                       based on the reference genome(s):
                                       in this case the scaffolded contigs
                                       will be separated by a number of N
                                       characters equal to this estimate.
                                       The estimated distances are also
                                       saved in the
                                       _distanceTable file.
                                       By default the scaffolded contigs
                                       are separated by 100 Ns
 -f <>                   OPTIONAL PARAMETER; The option *-f*
                                       is optional and indicates the path
                                       to the comparison drafts folder
 -gexf                                 OPTIONAL PARAMETER;Conting network
                                       and path cover are given in gexf
 -h                                    Print this help and exist.
 -i <>                   REQUIRED PARAMETER;The option *-i*
                                       indicates the name of the target
                                       genome file.
 -n50 <>                    OPTIONAL PARAMETER; The option
                                       *-n50* allows the calculation of
                                       the N50 statistic on a FASTA file.
                                       In this case the usage is the
                                       following: java -jar medusa.jar
                                       -n50 . All the
                                       other options will be ignored.
 -o <>                     OPTIONAL PARAMETER; The option *-o*
                                       indicates the name of output fasta
 -random <>            OPTIONAL PARAMETER;The option
                                       *-random* is available (not
                                       required). This option allows the
                                       user to run a given number of
                                       cleaning rounds and keep the best
                                       solution. Since the variability is
                                       small 5 rounds are usually
                                       sufficient to find the best score.
 -scriptPath <>   OPTIONAL PARAMETER; The folder
                                       containing the medusa scripts.
                                       Default value: medusa_scripts
 -v                                    RECOMMENDED PARAMETER; The option
                                       *-v* (recommended) print on console
                                       the information given by the
                                       package MUMmer. This option is
                                       strongly suggested to understand if
                                       MUMmer is not running properly.
 -w2                                   OPTIONAL PARAMETER;The option *-w2*
                                       is optional and allows for a
                                       sequence similarity based weighting
                                       scheme. Using a different weighting
                                       scheme may lead to better results.

[user@cn3144 ~]$ medusa -f reference_genomes/ -i Rhodobacter_target.fna -v 
INPUT FILE:Rhodobacter_target.fna
Running MUMmer...done.
Building the network...done.
Cleaning the network...done.
Scaffolds File saved: Rhodobacter_target.fnaScaffold.fasta
Number of scaffolds: 78 (singletons = 32, multi-contig scaffold = 46)
from 564 initial fragments.
Total length of the jointed fragments: 4224838
Computing N50 on 78 sequences.
N50: 143991.0
Summary File saved: Rhodobacter_target.fna_SUMMARY

[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. For example:

set -e
module load medusa
medusa -f reference_genomes/ -i Rhodobacter_target.fna -v

Submit this job using the Slurm sbatch command.

sbatch --cpus-per-task=2 --mem=2g
Swarm of Jobs
A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.

Create a swarmfile (e.g. medusa.swarm). For example:

cd dir1;medusa -f reference_genomes/ -i 1_target.fna -v
cd dir2;medusa -f reference_genomes/ -i 2_target.fna -v 
cd dir3;medusa -f reference_genomes/ -i 3_target.fna -v
cd dir4;medusa -f reference_genomes/ -i 4_target.fna -v

Submit this job using the swarm command.

swarm -f medusa.swarm [-g #] [-t #] --module medusa
-g # Number of Gigabytes of memory required for each process (1 line in the swarm command file)
-t # Number of threads/CPUs required for each process (1 line in the swarm command file).
--module medusa Loads the medusa module for each subjob in the swarm