Biowulf High Performance Computing at the NIH
PROVEAN on Biowulf

PROVEAN (PROtein Variation Effect ANalyzer) is a software tool which predicts whether an amino acid substitution or indel has an impact on the biological function of a protein. PROVEAN is useful for filtering sequence variants to identify nonsynonymous or indel variants that are predicted to be functionally important. The performance of PROVEAN is comparable to popular tools such as SIFT or PolyPhen-2.

References:

Documentation
Important Notes

NOTE 1: PROVEAN uses the NCBI nr blast database. When a large number of PROVEAN jobs are to be run simultaneously, it is best to make a local copy of the BLAST nr database. This can be done by including the option --local_nr with the provean.sh commandline. There is no reason to use the --local_nr option for one or two jobs, however, as the time taken to copy the database may outweigh the time needed to complete the analysis.

NOTE 2: PROVEAN is a multithreaded program. The number of threads can be changed by setting the option --num_threads. By default the program only uses 1 thread.

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Sample session:

[user@biowulf]$ sinteractive --cpus-per-task=8 --mem=20g
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144 ~]$ module load PROVEAN
[user@cn3144 ~]$ ln -s $PROVEAN_EXAMPLES/* .
[user@cn3144 ~]$ provean.sh -q P04637.fasta -v P04637.var --save_supporting_set P04637.sss --num_threads $SLURM_CPUS_PER_TASK

[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. PROVEAN.sh). For example:

#!/bin/bash
module load PROVEAN
provean.sh -q myfasta.fasta -v myfasta.var --save_supporting_set myfasta.sss --num_threads $SLURM_CPUS_PER_TASK > myfasta.out 2>&1

Submit this job using the Slurm sbatch command.

sbatch [--cpus-per-task=#] [--mem=#] PROVEAN.sh
Swarm of Jobs
A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.

Create a swarmfile (e.g. PROVEAN.swarm). For example:

provean.sh -q seq1.fasta -v seq1.var --local_nr --save_supporting_set seq1.sss
provean.sh -q seq2.fasta -v seq2.var --local_nr --save_supporting_set seq2.sss
provean.sh -q seq3.fasta -v seq3.var --local_nr --save_supporting_set seq3.sss
provean.sh -q seq4.fasta -v seq4.var --local_nr --save_supporting_set seq4.sss

Submit this job using the swarm command.

swarm -f PROVEAN.swarm -g 20 -t 16 --module PROVEAN --gres lscratch:200 --bundle 1000 --time 20
where
-g 20 Number of Gigabytes of memory required for each process (1 line in the swarm command file)
-t 16 Number of threads/CPUs required for each process (1 line in the swarm command file)
--module PROVEAN Loads the PROVEAN module for each subjob in the swarm
--gres lscratch:200 Allocates 200GB of local scratch space for the nr database
--bundle 1000 Makes sure that a small number of nodes are allocated for maximum benefit of --local_nr
--time 20 Average number of minutes per process (1 line in the swarm command file)

Keep in mind it takes about 15-20 minutes to copy the nr database to the local scratch drive. If the number of sequences is less than 50, then this may become a factor.