A draft genome scaffolder that uses multiple reference genomes in a graph-based approach.
Features
medusa --help
Allocate an interactive session and run the program.
Sample session (user input in bold):
[user@biowulf]$ sinteractive salloc.exe: Pending job allocation 46116226 salloc.exe: job 46116226 queued and waiting for resources salloc.exe: job 46116226 has been allocated resources salloc.exe: Granted job allocation 46116226 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn3144 are ready for job [user@cn3144 ~]$ module load medusa [user@cn3144 ~]$ cp -r $MEDUSA_TEST_DATA/* . [user@cn3144 ~]$ medusa --help medusa --help Medusa version 1.6 usage: java -jar medusa.jar -i inputfile -v available options: -d OPTIONAL PARAMETER;The option *-d* allows for the estimation of the distance between pairs of contigs based on the reference genome(s): in this case the scaffolded contigs will be separated by a number of N characters equal to this estimate. The estimated distances are also saved in the_distanceTable file. By default the scaffolded contigs are separated by 100 Ns -f < > OPTIONAL PARAMETER; The option *-f* is optional and indicates the path to the comparison drafts folder -gexf OPTIONAL PARAMETER;Conting network and path cover are given in gexf format. -h Print this help and exist. -i < > REQUIRED PARAMETER;The option *-i* indicates the name of the target genome file. -n50 < > OPTIONAL PARAMETER; The option *-n50* allows the calculation of the N50 statistic on a FASTA file. In this case the usage is the following: java -jar medusa.jar -n50 . All the other options will be ignored. -o < > OPTIONAL PARAMETER; The option *-o* indicates the name of output fasta file. -random < > OPTIONAL PARAMETER;The option *-random* is available (not required). This option allows the user to run a given number of cleaning rounds and keep the best solution. Since the variability is small 5 rounds are usually sufficient to find the best score. -scriptPath < > OPTIONAL PARAMETER; The folder containing the medusa scripts. Default value: medusa_scripts -v RECOMMENDED PARAMETER; The option *-v* (recommended) print on console the information given by the package MUMmer. This option is strongly suggested to understand if MUMmer is not running properly. -w2 OPTIONAL PARAMETER;The option *-w2* is optional and allows for a sequence similarity based weighting scheme. Using a different weighting scheme may lead to better results. [user@cn3144 ~]$ medusa -f reference_genomes/ -i Rhodobacter_target.fna -v INPUT FILE:Rhodobacter_target.fna ------------------------------------------------------------------------------------------------------------------------ Running MUMmer...done. ------------------------------------------------------------------------------------------------------------------------ Building the network...done. ------------------------------------------------------------------------------------------------------------------------ Cleaning the network...done. ------------------------------------------------------------------------------------------------------------------------ Scaffolds File saved: Rhodobacter_target.fnaScaffold.fasta ------------------------------------------------------------------------------------------------------------------------ Number of scaffolds: 78 (singletons = 32, multi-contig scaffold = 46) from 564 initial fragments. Total length of the jointed fragments: 4224838 Computing N50 on 78 sequences. N50: 143991.0 ---------------------- Summary File saved: Rhodobacter_target.fna_SUMMARY [user@cn3144 ~]$ exit salloc.exe: Relinquishing job allocation 46116226 [user@biowulf ~]$
Create a batch input file (e.g. medusa.sh). For example:
#!/bin/bash set -e module load medusa medusa -f reference_genomes/ -i Rhodobacter_target.fna -v
Submit this job using the Slurm sbatch command.
sbatch --cpus-per-task=2 --mem=2g medusa.sh
Create a swarmfile (e.g. medusa.swarm). For example:
cd dir1;medusa -f reference_genomes/ -i 1_target.fna -v cd dir2;medusa -f reference_genomes/ -i 2_target.fna -v cd dir3;medusa -f reference_genomes/ -i 3_target.fna -v cd dir4;medusa -f reference_genomes/ -i 4_target.fna -v
Submit this job using the swarm command.
swarm -f medusa.swarm [-g #] [-t #] --module medusawhere
-g # | Number of Gigabytes of memory required for each process (1 line in the swarm command file) |
-t # | Number of threads/CPUs required for each process (1 line in the swarm command file). |
--module medusa | Loads the medusa module for each subjob in the swarm |