High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
hhsuite on Biowulf & Helix

Description

The HH-suite is an open-source software package for sensitive protein sequence searching based on the pairwise alignment of hidden Markov models (HMMs).

Overview of programs

Main HMM search/align tools
hhblits (Iteratively) search an HH-suite database with a query sequence or MSA
hhsearch Search an HHsearch database of HMMs with a query MSA or HMM
hhmake Build an HMM from an input MSA
hhfilter Filter an MSA by max sequence identity, coverage, and other criteria
hhalign Calculate pairwise alignments, dot plots etc. for two HMMs/MSAs
hhconsensus Calculate the consensus sequence for an A3M/FASTA input file
Utility scripts
reformat.pl Reformat one or many MSAs
addss.pl Add PSIPRED predicted secondary structure to an MSA or HHM file
hhmakemodel.pl Generate MSAs or coarse 3D models from HHsearch or HHblits results
hhsuitedb.pl Build HH-suite database with prefiltering, packed MSA/HMM, and index files
multithread.pl Run a command for many files in parallel using multiple threads
splitfasta.pl Split a multiple-sequence FASTA file into multiple single-sequence files
renumberpdb.pl Generate PDB file with indices renumbered to match input sequence indices
mergeali.pl Merge MSAs in A3M format according to an MSA of their seed sequences
pdb2fasta.pl Generate FASTA sequence file from SEQRES records of globbed pdb files
pdbfilter.pl Generate representative set of PDB/SCOP sequences from pdb2fasta.pl output

There may be multiple versions of hhsuite available. An easy way of selecting the version is to use modules. To see the modules available, type

module avail hhsuite 

To select a module use

module load hhsuite/[version]

where [version] is the version of choice.

hhsuite is a multithreaded application. Make sure to match the number of cpus requested with the number of threads.

Data location

Databases for hhsuite can be found at /fdb/hhsuite.

Environment variables set

References

Documentation

Running hhblits/hhsearch efficiently on the cluster

hhblits and hhsearch need to do many random file access and read operations. The central file systems (i.e. /data, /scratch, or /fdb) will not perform well under this type of load. This means that running against the database directly stored on /fdb will not be performant. In addition it may tax the file system enough to also slow down other user's computations. We therefore recommend to copy the database to be searched to lscratch. In particular, nodes with SSD storage should be used. This means that the ideal usage pattern for large hhblits/hhsearch jobs is to allocate a node (or nodes) exclusively, copy the database to lscratch, and then run all computations on that node.

On Helix

For anything other than the simplest processes please use an interactive session or a batch job.

Batch job on Biowulf

Create a batch script similar to the following example:

#! /bin/bash
# this file is hhblits.batch

module load hhsuite || exit 1
cd /lscratch/$SLURM_JOB_ID || exit 1
cp -r /fdb/hhsuite/uniprot20_2016_02 .

hhblits -i /usr/local/apps/hhsuite/TEST_DATA/query.seq \
  -d ./uniprot20_2016_02/uniprot20_2016_02 \
  -cpu $SLURM_CPUS_PER_TASK -o test.hhr \
  -oa3m test.a3m -n 6

hhmake -i test.a3m -o test.hhm
addss.pl test.hhm test_addss.hhm -hmm
cp test* /path/to/output/dir

Submit to the queue with sbatch:

biowulf$ sbatch --partition=norm --cpus-per-task=6 \
  --gres=lscratch:100 hhblits.batch
Interactive job on Biowulf

Allocate an interactive session with sinteractive and use as described above

biowulf$ sinteractive --constraint=ssd800 --gres=lscratch:100 \
    --cpus-per-task=10
node$ module load hhsuite
[+] Loading openmpi 1.10.3 for GCC 5.3.0
[+] Loading gcc 5.3.0 ...
[+] Loading hhsuite 3.0-beta.1
node$ cd /lscratch/$SLURM_JOB_ID
node$ cp -r fdb/hhsuite/uniprot20_2016_02 .
node$ hhblits -cpu 10 \
    -i /usr/local/apps/hhsuite/DOWNLOADS/hh-suite/data/query.a3m \
    -d uniprot20_2016_02/uniprot20_2016_02 -o query.hhr
node$ exit
biowulf$