High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
Hmmer on Biowulf & Helix

hammer_sm

Profile hidden Markov models for biological sequence analysis

Profile hidden Markov models (profile HMMs) can be used to do sensitive database searching using statistical descriptions of a sequence family's consensus. HMMER uses profile HMMs, and can be useful in situations like:

HMMER (pronounced 'hammer', as in a more precise mining tool than BLAST) was developed by Sean Eddy at Washington University in St. Louis. The HMMER website is hmmer.janelia.org.

HMMER User Guide (PDF)

HMMER is a cpu-intensive program and is parallelized using threads, so that each instance of hmmsearch or the other search programs can use all the cpus available on a node.

The easiest way to add the HMMER executables to your path is by using the module command: module load hmmer. This will load the latest version.

Searching a sequence database with a profile HMM

One use of HMMER is to search a sequence database with a single profile HMM created out of a set of aligned sequences. You would first align your set of sequences with hmmalign or a program such as ClustalW, then build a profile HMM from the alignment with hmmbuild, then run a search against a database with this profile HMM with hmmsearch. (read the HMMER User Guide for details on the format of the aligned sequence file).

The input file globins4.sto used in this example is available in /usr/local/apps/hmmer/tutorial/ along with other sample files.

Create a batch script along the following lines:

#!/bin/bash
# this file is hmmer.bat

# load the latest default version
module load hmmer/3.1b1

echo "Running on $(( SLURM_CPUS_PER_TASK - 1 )) cpus"

# cd to the submitting directory
cd $SLURM_SUBMIT_DIR

# copy the sample Stockholm file 
cp /usr/local/apps/hmmer/tutorial/globins4.sto .

hmmbuild --cpu $(( SLURM_CPUS_PER_TASK - 1 )) globins4.hmm globins4.sto
vvhmmsearch --cpu $(( SLURM_CPUS_PER_TASK - 1 )) globins4.hmm /fdb/fastadb/nr.aa.fas > globins4.out

Submit this job with:

sbatch --cpus-per-task=4 hmmer.bat

The default mem-per-CPU allocation is 2 GB, which is sufficient for this example. If your own hmmer job requires more than 2 * number-of-cpus-requested GB of memory (2 * 4 = 8 GB in the example above), you should specify this with:

sbatch --cpus-per-task=# --mem=#g hmmer.bat
Note that HMMER is set to use one less than the allocated CPUs. This is because the program spawns one additional process beyond the number of threads.

Searching a profile HMM database with a query sequence

The hmmscan program is for annotating all the different known/detectable domains in a given sequence. If you have only a single sequence, you could run hmmscan interactively on Helix or on a Biowulf interactive session. If you have several query sequences, it is advantageous to run them simultaneously on Biowulf.

hmmscan runs against an HMM database such as Pfam. The Pfam database is indexed for hmmscan in /fdb/hmmer/pfam You can also create your own HMM database -- see the HMMER User Guide for details.

Create a swarm command file with one line for each of the query sequences. Sample swarm command file:

---------------- file swarm.cmd ----------------------------------------------------
hmmscan --cpu $(( SLURM_CPUS_PER_TASK - 1 )) -o seq1.out /fdb/hmmer/pfam/Pfam_fs   seq1 
hmmscan --cpu $(( SLURM_CPUS_PER_TASK - 1 )) -o seq2.out /fdb/hmmer/pfam/Pfam_fs   seq2 
hmmscan --cpu $(( SLURM_CPUS_PER_TASK - 1 )) -o seq3.out /fdb/hmmer/pfam/Pfam_fs   seq3
hmmscan --cpu $(( SLURM_CPUS_PER_TASK - 1 )) -o seq4.out /fdb/hmmer/pfam/Pfam_fs   seq4
[....]
------------------------------------------------------------------------------------

Submit this with, for example

swarm -t 4 -f swarm.cmd --module hmmer

By default, slurm will read and write into the directory from which you submitted the job. If your input sequences 'seq1', 'seq2' etc are not in your current working directory, you should provide the full pathname in the swarm command file.

Note that the variable '$SLURM_CPUS_PER_TASK' is being used to define the number of cpus within the swarm command file. The number of threads you set in 'swarm -t #' will be passed to this variable and into the script. Since HMMER spawns one additional process, the number of threads is set to one less than the allocated CPUs. Thus, you do not need to hardcode the number of cpus for each hmmscan run into your swarm command file.

Documentation

The entire HMMER suite of programs is available in /usr/local/apps/hmmer. Note that only hmmcalibrate, hmmsearch and hmmpfam are parallelized.

A large collection of protein sequence databases is in /fdb/fastadb/.
Fasta-format databases and update status.

PFAM indexed for HMMER

User Guide for v3.1b1 (PDF)