Biowulf High Performance Computing at the NIH
drep on Biowulf

dRep is a python program for rapidly comparing large numbers of genomes. dRep can also "de-replicate" a genome set by identifying groups of highly similar genomes and choosing the best representative genome for each genome set.

References:

Documentation
Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program.
Sample session (user input in bold):

[user@biowulf]$ sinteractive --cpus-per-task=10 --mem=10G
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144]$ module load drep
[user@cn3144]$ mkdir /data/$USER/drep_test/
[user@cn3144]$ cd /data/$USER/drep_test/
[user@cn3144]$ cp -r ${DREP_TEST_DATA:-none}/* .
[user@cn3144]$ dRep --help
...::: dRep v3.2.2 :::...

  Matt Olm. MIT License. Banfield Lab, UC Berkeley. 2017 (last updated 2020)

  See https://drep.readthedocs.io/en/latest/index.html for documentation
  Choose one of the operations below for more detailed help.

  Example: dRep dereplicate -h

  Commands:
    compare            -> Compare and cluster a set of genomes
    dereplicate        -> De-replicate a set of genomes
    check_dependencies -> Check which dependencies are properly installed


[user@cn3144]$ dRep dereplicate out3 -g ./test/*fasta
***************************************************
    ..:: dRep dereplicate Step 1. Filter ::..
***************************************************

Will filter the genome list
2 genomes were input to dRep
Calculating genome info of genomes
100.00% of genomes passed length filtering
Running prodigal
Running checkM
GenomeInfo has no values over 1 for contamination- these should be 0-100, not 0-1!
100.00% of genomes passed checkM filtering
***************************************************
    ..:: dRep dereplicate Step 2. Cluster ::..
***************************************************

Running primary clustering
Running pair-wise MASH clustering
2 primary clusters made
Running secondary clustering
Running 2 ANImf comparisons- should take ~ 0.2 min
Step 4. Return output
***************************************************
    ..:: dRep dereplicate Step 3. Choose ::..
***************************************************

Loading work directory
GenomeInfo has no values over 1 for contamination- these should be 0-100, not 0-1!
***************************************************
    ..:: dRep dereplicate Step 4. Evaluate ::..
***************************************************

will provide warnings about clusters
0 warnings generated: saved to /spin1/home/linux/apptest1/out3/log/warnings.txt
will produce Widb (winner information db)
Winner database saved to /spin1/home/linux/apptest1/out3data_tables/Widb.csv
***************************************************
    ..:: dRep dereplicate Step 5. Analyze ::..
***************************************************

making plots 1, 2, 3, 4, 5, 6
Plotting primary dendrogram
Plotting secondary dendrograms
Plotting MDS plot
Plotting scatterplots
GenomeInfo has no values over 1 for contamination- these should be 0-100, not 0-1!
Plotting bin scorring plot
GenomeInfo has no values over 1 for contamination- these should be 0-100, not 0-1!
Plotting winning genomes plot...

$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

    ..:: dRep dereplicate finished ::..

Dereplicated genomes................. /spin1/home/linux/apptest1/out3/dereplicated_genomes/
Dereplicated genomes information..... /spin1/home/linux/apptest1/out3/data_tables/Widb.csv
Figures.............................. /spin1/home/linux/apptest1/out3/figures/
Warnings............................. /spin1/home/linux/apptest1/out3/log/warnings.txt

$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

[user@cn3144]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. drep.sh). For example:


#!/bin/bash

#SBATCH --job-name=drep_run
#SBATCH --time=2:00:00

#SBATCH --partition=norm
#SBATCH --nodes=1
#SBATCH --mem=20g
#SBATCH --cpus-per-task=4

cd /data/$USER/drep_test
module load drep
dRep dereplicate out2 -g ./test/*fasta 



Submit this job using the Slurm sbatch command.

sbatch drep.sh