iphop on Biowulf
iPHoP stands for Integrated Phage Host Prediction. It is an automated command-line pipeline designed to predict the host genus of novel bacteriophages and archaeoviruses based on their genome sequences.
- Phage-based Tool:
- RaFAH (DOI: 10.1016/j.patter.2021.100274): Yields a prediction of host genus with an associated score, stored for later use (see Step 5).
- Host-based Tools:
- blastn to Host Genomes: All hits with ≥ 80% identity and ≥ 500bp are considered. Hits covering ≥ 50% of the "host" contig length are ignored as these often derive from contigs (nearly) entirely viral, and can easily be contaminant in genomes or MAGs and thus not reliable for host prediction.
- blastn to CRISPR Spacer Database: All hits with up to 4 mismatches are considered.
- WIsH (DOI: 10.1093/bioinformatics/btx383): Host association based on k-mer composition similarity between virus and host genome.
- VHM - s2* Similarity (DOI: 10.1093/nar/gkw1002, DOI: 10.1093/nargab/lqaa044): Host association based on k-mer composition similarity between virus and host genome.
- PHP (DOI: 10.1186/s12915-020-00938-6): Host association based on k-mer composition similarity between virus and host genome.
- Distance between two potential hosts, i.e., two hits for a given tool and a given query virus, is based on the GTDB trees (DOI: 10.1093/nar/gkab776).
- For each hit, the top other hits from the same virus with the same tool are compiled and organized according to the distance between the base hit host and the other hit host (see step 2).
- These series of hits are used as input for automated classifiers to derive a score for the given virus - candidate host pair.
- This enables the evaluation of every potential host (every hit) when considering the context of the top hits obtained for this virus.
- The top scores based only on blast or crispr matches are retained, as these methods can be reliable enough by themselves for host prediction.
- A third score is obtained by considering all scores from all individual classifiers (see step 4), i.e., taking into account all 5 host-based methods.
- The 3 host-based scores (see step 5) are then considered alongside the phage-based score (RaFAH - DOI: 10.1016/j.patter.2021.100274) to obtain a single score for all pairs of virus - candidate host genus.
References:
- iPHoP: An integrated machine learning framework to maximize host prediction for metagenome-derived viruses of archaea and bacteria
Documentation
Important Notes
- Important environment variables: $IPHOP_TEST_DATA
Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.
Allocate an interactive session and run the program.
Sample session (user input in bold):
[user@biowulf]$ sinteractive --gres=lscratch:10 -c 8 --mem=32g salloc.exe: Pending job allocation 46116226 salloc.exe: job 46116226 queued and waiting for resources salloc.exe: job 46116226 has been allocated resources salloc.exe: Granted job allocation 46116226 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn3144 are ready for job [user@cn3144 ~]$ module load iphop [user@cn3144 ~]$ cd /data/$USER/ [user@cn3144 ~]$ cp ${IPHOP_TEST_DATA:-none}/* . [user@cn3144 ~]$ iphop predict test_input_phages.fna --db_dir /fdb/iphop_db/Aug_2023_pub_rw/ --out_dir iphop_out
Batch job
Most jobs should be run as batch jobs.
Create a batch input file (e.g. iphop.sh). For example:
#!/bin/bash set -e module load iphop cd /data/$USER iphop predict test_input_phages.fna --db_dir /fdb/iphop_db/Aug_2023_pub_rw/ --out_dir iphop_out
Submit this job using the Slurm sbatch command.
sbatch [--cpus-per-task=#] [--mem=#] iphop.sh