dRep is a python program for rapidly comparing large numbers of genomes. dRep can also "de-replicate" a genome set by identifying groups of highly similar genomes and choosing the best representative genome for each genome set.
dRep --help
ls out/figures/ Clustering_scatterplots.pdf Cluster_scoring.pdf Primary_clustering_dendrogram.pdf Secondary_clustering_dendrograms.pdf Winning_genomes.pdf
Allocate an interactive session and run the program.
Sample session (user input in bold):
[user@biowulf]$ sinteractive --cpus-per-task=10 --mem=20G
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job
[user@cn3144]$ module load drep
[user@cn3144]$ mkdir /data/$USER/drep_test/
[user@cn3144]$ cd /data/$USER/drep_test/
[user@cn3144]$ cp -r ${DREP_TEST_DATA:-none}/* .
[user@cn3144]$ dRep --help
...::: dRep v3.2.2 :::...
Matt Olm. MIT License. Banfield Lab, UC Berkeley. 2017 (last updated 2020)
See https://drep.readthedocs.io/en/latest/index.html for documentation
Choose one of the operations below for more detailed help.
Example: dRep dereplicate -h
Commands:
compare -> Compare and cluster a set of genomes
dereplicate -> De-replicate a set of genomes
check_dependencies -> Check which dependencies are properly installed
[user@cn3144]$ dRep dereplicate out3 -g ./test/*fasta
***************************************************
..:: dRep dereplicate Step 1. Filter ::..
***************************************************
Will filter the genome list
2 genomes were input to dRep
Calculating genome info of genomes
100.00% of genomes passed length filtering
Running prodigal
Running checkM
GenomeInfo has no values over 1 for contamination- these should be 0-100, not 0-1!
100.00% of genomes passed checkM filtering
***************************************************
..:: dRep dereplicate Step 2. Cluster ::..
***************************************************
Running primary clustering
Running pair-wise MASH clustering
2 primary clusters made
Running secondary clustering
Running 2 ANImf comparisons- should take ~ 0.2 min
Step 4. Return output
***************************************************
..:: dRep dereplicate Step 3. Choose ::..
***************************************************
Loading work directory
GenomeInfo has no values over 1 for contamination- these should be 0-100, not 0-1!
***************************************************
..:: dRep dereplicate Step 4. Evaluate ::..
***************************************************
will provide warnings about clusters
0 warnings generated: saved to /spin1/home/linux/apptest1/out3/log/warnings.txt
will produce Widb (winner information db)
Winner database saved to /spin1/home/linux/apptest1/out3data_tables/Widb.csv
***************************************************
..:: dRep dereplicate Step 5. Analyze ::..
***************************************************
making plots 1, 2, 3, 4, 5, 6
Plotting primary dendrogram
Plotting secondary dendrograms
Plotting MDS plot
Plotting scatterplots
GenomeInfo has no values over 1 for contamination- these should be 0-100, not 0-1!
Plotting bin scorring plot
GenomeInfo has no values over 1 for contamination- these should be 0-100, not 0-1!
Plotting winning genomes plot...
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
..:: dRep dereplicate finished ::..
Dereplicated genomes................. /spin1/home/linux/apptest1/out3/dereplicated_genomes/
Dereplicated genomes information..... /spin1/home/linux/apptest1/out3/data_tables/Widb.csv
Figures.............................. /spin1/home/linux/apptest1/out3/figures/
Warnings............................. /spin1/home/linux/apptest1/out3/log/warnings.txt
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
[user@cn3144]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$
Create a batch input file (e.g. drep.sh). For example:
#!/bin/bash
#SBATCH --job-name=drep_run
#SBATCH --time=2:00:00
#SBATCH --partition=norm
#SBATCH --nodes=1
#SBATCH --mem=20g
#SBATCH --cpus-per-task=4
cd /data/$USER/drep_test
module load drep
dRep dereplicate out2 -g ./test/*fasta
Submit this job using the Slurm sbatch command.
sbatch drep.sh