Biowulf High Performance Computing at the NIH
HYPHY on Biowulf

HyPhy (Hypothesis Testing using Phylogenies) is an open-source software package for the analysis of genetic sequences (in particular the inference of natural selection) using techniques in phylogenetics, molecular evolution, and machine learning. It features a rich scripting language for limitless customization of analyses.

Documentation
Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program.
Sample session (user input in bold):

[user@biowulf]$ sinteractive --cpus-per-task=8 --mem=20g
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144 ~]$ module load hyphy

[user@cn3144 ~]$ mkdir /data/$USER/hyphy

[user@cn3144 ~]$ cp $HYPHY_TESTDATA/CD2.*   /data/$USER/hyphy/ 

[user@cn3144 ~]$ hyphy CPU=$SLURM_CPUS_PER_TASK GARD --alignment CD2.nex --tree CD2.newick

Analysis Description
--------------------
SLAC (Single Likelihood Ancestor Counting) uses a maximum likelihood
ancestral state reconstruction and minimum path substitution counting to
estimate site - level dS and dN, and applies a simple binomial - based
test to test if dS differs drom dN. The estimates aggregate information
over all branches, so the signal is derived from pervasive
diversification or conservation. A subset of branches can be selected
for testing as well. Multiple partitions within a NEXUS file are also
supported for recombination - aware analysis.

- __Requirements__: in-frame codon alignment and a phylogenetic tree

- __Citation__: Not So Different After All: A Comparison of Methods for Detecting Amino
Acid Sites Under Selection (2005). _Mol Biol Evol_ 22 (5): 1208-1222

- __Written by__: Sergei L Kosakovsky Pond and Simon DW Frost

- __Contact Information__: spond@temple.edu

- __Analysis Version__: 2.00


>code -> Universal
>Loaded a multiple sequence alignment with **10** sequences, **187** codons, and **1** partitions from `/spin1/users/$USER/hyphy/CD2.nex`

>branches -> All

>Select the number of samples used to assess ancestral reconstruction uncertainty [select 0 to skip] (permissible range = [0,100000], default value = 100, integer):
>samples -> 100

>Select the p-value threshold to use when testing for selection (permissible range = [0,1], default value = 0.1):
>pvalue -> 0.1


### Branches to include in the SLAC analysis
Selected 16 branches to include in SLAC calculations: `PIG, COW, Node3, HORSE, CAT, Node2, RHMONKEY, BABOON, Node9, HUMAN, CHIMP, Node12, Node8, Node1, RAT, MOUSE`


### Obtaining branch lengths and nucleotide substitution biases under the nucleotide GTR model
* Log(L) = -3532.32, AIC-c =  7112.86 (24 estimated parameters)

### Obtaining the global omega estimate based on relative GTR branch lengths and nucleotide substitution biases
* Log(L) = -3467.32, AIC-c =  6997.72 (31 estimated parameters)
* non-synonymous/synonymous rate ratio for *test* =   1.0079

### Performing joint maximum likelihood ancestral state reconstruction

### For partition 1 these sites are significant at p <=0.1

|     Codon      |   Partition    |       S        |       N        |       dS       |       dN       |Selection detected?|
|:--------------:|:--------------:|:--------------:|:--------------:|:--------------:|:--------------:|:-----------------:|
|       47       |       1        |     2.500      |     1.500      |     5.332      |     0.595      |  Neg. p = 0.066   |
|       65       |       1        |     2.000      |     0.000      |     3.834      |     0.000      |  Neg. p = 0.030   |
|       78       |       1        |     2.000      |     1.000      |     4.024      |     0.400      |  Neg. p = 0.073   |
|       82       |       1        |     2.000      |     0.000      |     4.490      |     0.000      |  Neg. p = 0.033   |
|       87       |       1        |     3.000      |     0.000      |     5.829      |     0.000      |  Neg. p = 0.009   |
|      110       |       1        |     3.000      |     0.000      |     3.000      |     0.000      |  Neg. p = 0.037   |
|      116       |       1        |     2.000      |     0.000      |     4.009      |     0.000      |  Neg. p = 0.034   |
|      123       |       1        |     2.000      |     0.000      |     3.907      |     0.000      |  Neg. p = 0.035   |
|      130       |       1        |     2.000      |     1.000      |     4.490      |     0.391      |  Neg. p = 0.060   |
|      164       |       1        |     2.000      |     0.000      |     4.692      |     0.000      |  Neg. p = 0.026   |
|      166       |       1        |     4.000      |     0.000      |     4.000      |     0.000      |  Neg. p = 0.012   |

### Ancestor sampling analysis

>Generating 100 ancestral sequence samples to obtain confidence intervals

Done with ancestral sampling
Resampling results for partition 1

|     Codon      |   Partition    |    S [median, IQR]    |    N [median, IQR]    |    dS [median, IQR]    |    dN [median, IQR]    |  p-value [median, IQR]  |
|:--------------:|:--------------:|:---------------------:|:---------------------:|:----------------------:|:----------------------:|:-----------------------:|
|       47       |       1        |   2.50 [1.00-3.50]    |   1.50 [1.50-3.00]    |    5.33 [1.80-7.31]    |    0.60 [0.59-1.27]    |    0.07 [0.02-0.57]     |
|       65       |       1        |   2.00 [2.00-3.00]    |   0.00 [0.00-0.00]    |    3.83 [3.83-5.92]    |    0.00 [0.00-0.00]    |    0.03 [0.00-0.03]     |
|       78       |       1        |   2.00 [2.00-3.00]    |   1.00 [1.00-4.00]    |    4.02 [3.94-5.96]    |    0.40 [0.40-1.60]    |    0.07 [0.02-0.27]     |
|       82       |       1        |   2.00 [2.00-3.00]    |   0.00 [0.00-0.00]    |    4.49 [4.49-6.97]    |    0.00 [0.00-0.00]    |    0.03 [0.01-0.03]     |
|       87       |       1        |   3.00 [3.00-5.00]    |   0.00 [0.00-0.00]    |   5.83 [5.83-11.32]    |    0.00 [0.00-0.00]    |    0.01 [0.00-0.01]     |
|      110       |       1        |   3.00 [3.00-4.00]    |   0.00 [0.00-0.00]    |    3.00 [3.00-4.00]    |    0.00 [0.00-0.00]    |    0.04 [0.01-0.04]     |
|      116       |       1        |   2.00 [2.00-3.16]    |   0.00 [0.00-0.00]    |    4.01 [4.01-6.87]    |    0.00 [0.00-0.00]    |    0.03 [0.00-0.03]     |
|      123       |       1        |   2.00 [2.00-3.00]    |   0.00 [0.00-0.00]    |    3.91 [3.91-6.34]    |    0.00 [0.00-0.00]    |    0.04 [0.01-0.04]     |
|      130       |       1        |   2.00 [2.00-4.00]    |   1.00 [1.00-2.00]    |    4.49 [4.49-8.17]    |    0.39 [0.39-0.78]    |    0.06 [0.01-0.11]     |
|      164       |       1        |   2.00 [2.00-3.00]    |   0.00 [0.00-0.47]    |    4.69 [4.69-6.80]    |    0.00 [0.00-0.22]    |    0.03 [0.00-0.05]     |
|      166       |       1        |   4.00 [4.00-5.16]    |   0.00 [0.00-0.00]    |    4.00 [4.00-5.16]    |    0.00 [0.00-0.00]    |    0.01 [0.00-0.01]     |


Check messages.log for diagnostic messages.

[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$
Interactive job - MPI version
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program.
Sample session (user input in bold):

Note that the MPI program MUST be run from the $HYPHY_LIB directory, otherwise you will get errors relating to .bf files not found.

[user@biowulf]$ sinteractive --ntasks=8 --ntasks-per-core=1 --mem=20g
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144 ~]$ module load hyphy

[user@cn3144 ~]$ mkdir /data/$USER/hyphy

[user@cn3144 ~]$ cp $HYPHY_TESTDATA/CD2.*   /data/$USER/hyphy/ 

[user@cn3144 ~]$ cd $HYPHY_LIB 

[user@cn3144 ~]$ mpirun -np $SLURM_NTASKS HYPHYMPI $HYPHY_TEMPLATES/GARD.bf


Analysis Description
--------------------
GARD : Genetic Algorithms for Recombination Detection. Implements a
heuristic approach to screening alignments of sequences for
recombination, by using the CHC genetic algorithm to search for
phylogenetic incongruence among different partitions of the data. The
number of partitions is determined using a step-up procedure, while the
placement of breakpoints is searched for with the GA. The best fitting
model (based on c-AIC) is returned; and additional post-hoc tests run to
distinguish topological incongruence from rate-variation.

- __Requirements__: A sequence alignment.

- __Citation__: **Automated Phylogenetic Detection of Recombination Using a Genetic
Algorithm**, _Mol Biol Evol 23(10), 1891-1901

- __Written by__: Sergei L Kosakovsky Pond

- __Contact Information__: spond@temple.edu

- __Analysis Version__: 0.1

type: Nucleotide

Select a sequence alignment file (`/spin1/scratch/$USER/hyphy/hyphy-2.5.1/`) /data/$USER/hyphy/CD2.nex
rv: None
>Loaded a Nucleotide multiple sequence alignment with **10** sequences, **561** sites (390 of which are variable) from `/data/$USER/hyphy/CD2.nex`
>Minimum size of a partition is set to be 17 sites


### Fitting the baseline (single-partition; no breakpoints) model
* Log(L) = -3529.89, AIC-c =  7112.21 (25 estimated parameters)

### Performing an exhaustive single breakpoint analysis
Done with single breakpoint analysis.
   Best sinlge break point location: 25
   c-AIC  = 7107.094096738777

### Performing multi breakpoint analysis using a genetic algorithm
[...]

[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$

Batch job - threaded version
Most jobs should be run as batch jobs.

Create a batch input file (e.g. hyphy.sh). For example, to run the threaded version of hyphy:

#!/bin/bash
set -e
module load hyphy
cp $HYPHY_TESTDATA/CD2* .

hyphy CPU=$SLURM_CPUS_PER_TASK  GARD --alignment CD2.fasta --tree CD2.newick

Submit this job using the Slurm sbatch command.

sbatch --cpus-per-task=# [--mem=#] hyphy.sh

Batch job - MPI version

To run the MPI version of hyphy, here is a sample batch script. Note that the program must be run from the "lib" directory of the hyphy tree, otherwise you will get errors relating to the paths for the .bf files.

#!/bin/bash

module load hyphy
# copy the test data
mkdir /data/$USER/hyphy
cp $HYPHY_TESTDATA/CD2.*   /data/$USER/hyphy/ 

# run the program out of the $HYPHY_LIB directory
cd $HYPHY_LIB 
mpirun -np $SLURM_NTASKS HYPHYMPI $HYPHY_TEMPLATES/GARD.bf --alignment /data/$USER/hyphy/CD2.nex
Submit this job with:
sbatch --ntasks=8 -- ntasks-per-core=1 --mem=20g  hyphympi.bat