drep on Biowulf
dRep is a python program for rapidly comparing large numbers of genomes. dRep can also "de-replicate" a genome set by identifying groups of highly similar genomes and choosing the best representative genome for each genome set.
References:
- Matthew R Olm et al.dRep: a tool for fast and accurate genomic comparisons that enables improved genome recovery from metagenomes through de-replication The ISME Journal volume 11, pages2864–2868 (2017) Journal
Documentation
- drep Github:Github
Important Notes
- drep can be run as:
dRep --help
- Figures are located within the work directory under figures folder:
ls out/figures/ Clustering_scatterplots.pdf Cluster_scoring.pdf Primary_clustering_dendrogram.pdf Secondary_clustering_dendrograms.pdf Winning_genomes.pdf
Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.
Allocate an interactive session and run the program.
Sample session (user input in bold):
[user@biowulf]$ sinteractive --cpus-per-task=10 --mem=20G salloc.exe: Pending job allocation 46116226 salloc.exe: job 46116226 queued and waiting for resources salloc.exe: job 46116226 has been allocated resources salloc.exe: Granted job allocation 46116226 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn3144 are ready for job [user@cn3144]$ module load drep [user@cn3144]$ mkdir /data/$USER/drep_test/ [user@cn3144]$ cd /data/$USER/drep_test/ [user@cn3144]$ cp -r ${DREP_TEST_DATA:-none}/* . [user@cn3144]$ dRep --help ...::: dRep v3.2.2 :::... Matt Olm. MIT License. Banfield Lab, UC Berkeley. 2017 (last updated 2020) See https://drep.readthedocs.io/en/latest/index.html for documentation Choose one of the operations below for more detailed help. Example: dRep dereplicate -h Commands: compare -> Compare and cluster a set of genomes dereplicate -> De-replicate a set of genomes check_dependencies -> Check which dependencies are properly installed [user@cn3144]$ dRep dereplicate out3 -g ./test/*fasta *************************************************** ..:: dRep dereplicate Step 1. Filter ::.. *************************************************** Will filter the genome list 2 genomes were input to dRep Calculating genome info of genomes 100.00% of genomes passed length filtering Running prodigal Running checkM GenomeInfo has no values over 1 for contamination- these should be 0-100, not 0-1! 100.00% of genomes passed checkM filtering *************************************************** ..:: dRep dereplicate Step 2. Cluster ::.. *************************************************** Running primary clustering Running pair-wise MASH clustering 2 primary clusters made Running secondary clustering Running 2 ANImf comparisons- should take ~ 0.2 min Step 4. Return output *************************************************** ..:: dRep dereplicate Step 3. Choose ::.. *************************************************** Loading work directory GenomeInfo has no values over 1 for contamination- these should be 0-100, not 0-1! *************************************************** ..:: dRep dereplicate Step 4. Evaluate ::.. *************************************************** will provide warnings about clusters 0 warnings generated: saved to /spin1/home/linux/apptest1/out3/log/warnings.txt will produce Widb (winner information db) Winner database saved to /spin1/home/linux/apptest1/out3data_tables/Widb.csv *************************************************** ..:: dRep dereplicate Step 5. Analyze ::.. *************************************************** making plots 1, 2, 3, 4, 5, 6 Plotting primary dendrogram Plotting secondary dendrograms Plotting MDS plot Plotting scatterplots GenomeInfo has no values over 1 for contamination- these should be 0-100, not 0-1! Plotting bin scorring plot GenomeInfo has no values over 1 for contamination- these should be 0-100, not 0-1! Plotting winning genomes plot... $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$ ..:: dRep dereplicate finished ::.. Dereplicated genomes................. /spin1/home/linux/apptest1/out3/dereplicated_genomes/ Dereplicated genomes information..... /spin1/home/linux/apptest1/out3/data_tables/Widb.csv Figures.............................. /spin1/home/linux/apptest1/out3/figures/ Warnings............................. /spin1/home/linux/apptest1/out3/log/warnings.txt $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$ [user@cn3144]$ exit salloc.exe: Relinquishing job allocation 46116226 [user@biowulf ~]$
Batch job
Most jobs should be run as batch jobs.
Create a batch input file (e.g. drep.sh). For example:
#!/bin/bash
#SBATCH --job-name=drep_run
#SBATCH --time=2:00:00
#SBATCH --partition=norm
#SBATCH --nodes=1
#SBATCH --mem=20g
#SBATCH --cpus-per-task=4
cd /data/$USER/drep_test
module load drep
dRep dereplicate out2 -g ./test/*fasta
Submit this job using the Slurm sbatch command.
sbatch drep.sh