High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
BAli-Phy on Biowulf2 & Helix

Description

BAli-Phy can co-estimate alignments (nucleotide, codon, or amino acid) and phylogenetic trees with complex substitution models. It uses Markov chain Monte Carlo (MCMC) based methods.

The package contains the main executable (bali-phy) as well as a number of small command line utilities.

There are multiple versions of bali-phy available. An easy way of selecting the version is to use modules. To see the modules available, type

module avail bali-phy 

To select a module use

module load bali-phy/[version]

where [version] is the version of choice.

Environment variables set

References

Documentation

On Helix

Set up the environment and copy some example data:

helix$ module load bali-phy
helix$ mkdir -p /data/$USER/test_data/bali-phy
helix$ cd /data/$USER/test_data/bali-phy
helix$ cp /usr/local/apps/bali-phy/TEST_DATA/sequences/5S-rRNA/5d.fasta .
helix$ alignment-info 5d.fasta

Alignment: 126 columns of 5 sequences         Alphabet: DNA nucleotides
  sequence lengths: 120-126      mean = 122      median = 121

 ====== w/o indels ======
  const.: 5 (3.97%)      non-const.: 121 (96%)      inform.: 51 (40.5%)
  21.5% minimum sequence identity.

 ====== w/  indels ======
  const.: 3 (2.38%)      non-const.: 123 (97.6%)      inform.: 54 (42.9%)
  21% minimum sequence identity.

 ========   gaps ========
  6 (4.76%) sites contain a gap.
  3 indel groups seem to exist. (3 separate)
       unique/inform. = 3/0       ins./del. = 1/2
  gap lengths: 2-6      mean = 4.33      median = 5

Stop Codons:  8/11/2

Freqencies:   A=18.9%  G=32.6%  T=17.9%  C=30.6%
  Classes:  0 [0%]        Wildcards: 0 [0%]

Get some basic info about bali-phy and do a short MCMC run with few iterations

helix$ bali-phy -v
VERSION: 2.3.6  [master commit c1109600]  (Mar 26 2015 15:25:35)
BUILD: Mar 26 2015 15:46:26
ARCH: x86_64-unknown-linux-gnu
COMPILER: GCC 4.8.1 20130424 (prerelease)
FLAGS: -isystem $(top_srcdir)/boost/include -ffast-math -DNDEBUG -DNDEBUG_DP 
  -funroll-loops  -Wall -Wextra -Wno-sign-compare -Woverloaded-virtual 
  -Wstrict-aliasing -pipe -O3 -pedantic -I/usr/include/cairo 
  -I/usr/include/pixman-1 -I/usr/include/freetype2 -I/usr/include/libpng12   
  -finline-limit=1000 -std=gnu++11

helix$ bali-phy --help
Usage: bali-phy  [ [OPTIONS]]
[...]

# test without running any MCMC interations
helix$ bali-phy 5d.fasta --test 
[...]

# run a small number of iterations
helix$ bali-phy 5d.fasta --iterations=50
[...]

# explicitly select alphabet and substitution/indel models
helix$ bali-phy 5d.fasta --iterations=50 --alphabet=DNA --smodel=TN --imodel=RS07
[...]

The user guide gives much more details on the ways bali-phy can be used.

Batch job on Biowulf2

From the BAli-Phy usage guide:

Running bali-phy on a computing cluster is not necessary, but can speed up the analysis dramatically. This is because a cluster allows you to run several independent MCMC chains simultaneously and pool the resulting samples. You can run multiple chains simultaneously simply by starting several different instances of bali-phy. Each instance of bali-phy runs only one chain and does not require using MPI or special command-line options. This approach to parallel computation is sometimes more efficient than MCMCMC-based parallelism involving heated chains. It is equivalent to running MCMCMC with no temperature difference between chains, with the exception that it allows results from all chains to be used, instead of just results from the single "cold" chain. Thus, if you run 10 independent chains in parallel, then you may gather samples 10 times faster that a single chain.

So, for example, take the following batch script (still using small numbers of iterations):

#! /bin/bash
# filename: baliphy.sh
set -e

module load bali-phy/2.3.7 || exit 1

FA=/usr/local/apps/bali-phy/TEST_DATA/sequences/EF-Tu/5d.fasta
bali-phy $FA --iterations=2000
Start two runs
biowulf2$ sbatch baliphy.sh
biowulf2$ sbatch baliphy.sh

Note that bali-phy automatically sets up separate output directories. Once the MCMC runs are done, one can calculate a consensus tree:

helix$ trees-consensus 5d-1/C1.trees 5d-2/C1.trees > consensus.tree
Swarm of jobs on Biowulf2

As discussed above, running many bali-phy jobs in parallel may be beneficial. One easy way to set up such a analysis is by using swarm. Set up a swarm file with one command per line similar to the following:

bali-phy /usr/local/apps/bali-phy/TEST_DATA/sequences/EF-Tu/5d.fasta --iterations=2000
bali-phy /usr/local/apps/bali-phy/TEST_DATA/sequences/EF-Tu/5d.fasta --iterations=2000
bali-phy /usr/local/apps/bali-phy/TEST_DATA/sequences/EF-Tu/5d.fasta --iterations=2000
[...]

And submit the job swarm

swarm -f swarmfile --module bali-phy/2.3.7
For more details see the swarm documentation.
Interactive job on Biowulf2

Since it is not permissible to run resource intensive analyses on the biowulf2 login node, it may be useful to allocate an interactive node for initial experiments. For example

biowulf2$ sinteractive
salloc.exe: Granted job allocation 214002
node$ bali-phy /usr/local/apps/bali-phy/TEST_DATA/sequences/HIV/chain-2005/env-clustal-codons.fasta \
    --iterations=200 --smodel=M0[GTR] --alphabet=Codons 
[...]
node$ exit
salloc.exe: Relinquishing job allocation 214002
salloc.exe: Job allocation 214002 has been revoked.