iphop on Biowulf
iPHoP stands for Integrated Phage Host Prediction. It is an automated command-line pipeline designed to predict the host genus of novel bacteriophages and archaeoviruses based on their genome sequences.
The pipeline can be broken down into 6 main steps:
Step 1: Running Individual Host Prediction Tools
- Phage-based Tool:
- Host-based Tools:
- blastn to Host Genomes: All hits with ≥ 80% identity and ≥ 500bp are considered. Hits covering ≥ 50% of the "host" contig length are ignored as these often derive from contigs (nearly) entirely viral, and can easily be contaminant in genomes or MAGs and thus not reliable for host prediction.
- blastn to CRISPR Spacer Database: All hits with up to 4 mismatches are considered.
- WIsH (DOI: 10.1093/bioinformatics/btx383): Host association based on k-mer composition similarity between virus and host genome.
- VHM - s2* Similarity (DOI: 10.1093/nar/gkw1002, DOI: 10.1093/nargab/lqaa044): Host association based on k-mer composition similarity between virus and host genome.
- PHP (DOI: 10.1186/s12915-020-00938-6): Host association based on k-mer composition similarity between virus and host genome.
Step 2: Collect All Scores and All Distances Between Hits for Host-based Tools
- Distance between two potential hosts, i.e., two hits for a given tool and a given query virus, is based on the GTDB trees (DOI: 10.1093/nar/gkab776).
Steps 3 and 4: Compile an Organized List of Hits for Each Virus - Tool - Candidate Host Combination
- For each hit, the top other hits from the same virus with the same tool are compiled and organized according to the distance between the base hit host and the other hit host (see step 2).
- These series of hits are used as input for automated classifiers to derive a score for the given virus - candidate host pair.
- This enables the evaluation of every potential host (every hit) when considering the context of the top hits obtained for this virus.
Step 5: Derive 3 Scores for Host-based Tools for Each Virus - Candidate Host Combination
- The top scores based only on blast or crispr matches are retained, as these methods can be reliable enough by themselves for host prediction.
- A third score is obtained by considering all scores from all individual classifiers (see step 4), i.e., taking into account all 5 host-based methods.
Step 6: Calculate a Composite Score for Each Virus - Candidate Host Genus Combination Integrating Host-based and Phage-based Signals
- The 3 host-based scores (see step 5) are then considered alongside the phage-based score (RaFAH - DOI: 10.1016/j.patter.2021.100274) to obtain a single score for all pairs of virus - candidate host genus.
References:
Documentation
Important Notes
- Important environment variables: $IPHOP_TEST_DATA
Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.
Allocate an interactive session and run the program.
Sample session (user input in bold):
[user@biowulf]$ sinteractive --gres=lscratch:10 -c 8 --mem=32g
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job
[user@cn3144 ~]$ module load iphop
[user@cn3144 ~]$ cd /data/$USER/
[user@cn3144 ~]$ cp ${IPHOP_TEST_DATA:-none}/* .
[user@cn3144 ~]$ iphop predict test_input_phages.fna --db_dir /fdb/iphop_db/Aug_2023_pub_rw/ --out_dir iphop_out
Batch job
Create a batch input file (e.g. iphop.sh). For example:
#!/bin/bash
set -e
module load iphop
cd /data/$USER
iphop predict test_input_phages.fna --db_dir /fdb/iphop_db/Aug_2023_pub_rw/ --out_dir iphop_out
Submit this job using the Slurm sbatch command.
sbatch [--cpus-per-task=#] [--mem=#] iphop.sh