High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
MUSCLE on Biowulf & Helix

Description

MUSCLE is a popular multiple alignment program with good performance and accuracy. It can align hundreds of sequences quickly and has a simple command line interface with few options.

References

Web sites

Running muscle on Helix

First, load the module for muscle which sets the environment for running muscle:

helix$ module load muscle
helix$ module list
Currently Loaded Modulefiles:
    1) muscle/3.8.31

Run a simple alignment outputting aligned sequences in fasta format:

helix$ cd /data/$USER/test_data
helix$ module load muscle
helix$ muscle -in fasta/pox_e9l.fasta -out fasta/pox_e9l_aln.fasta

MUSCLE v3.8.31 by Robert C. Edgar

http://www.drive5.com/muscle
This software is donated to the public domain.
Please cite: Edgar, R.C. Nucleic Acids Res 32(5), 1792-97.

pox_e9l 100 seqs, max length 3318, avg  length 3024
00:00:00    12 MB(-1%)  Iter   1  100.00%  K-mer dist pass 1
00:00:00    12 MB(-1%)  Iter   1  100.00%  K-mer dist pass 2
00:00:29  207 MB(-10%)  Iter   1  100.00%  Align node
00:00:29  207 MB(-10%)  Iter   1  100.00%  Root alignment
00:00:39  207 MB(-10%)  Iter   2  100.00%  Refine tree
00:00:39  207 MB(-10%)  Iter   2  100.00%  Root alignment
00:00:39  207 MB(-10%)  Iter   2  100.00%  Root alignment
00:02:02  207 MB(-10%)  Iter   3  100.00%  Refine biparts
00:03:27  207 MB(-10%)  Iter   4  100.00%  Refine biparts
00:03:28  207 MB(-10%)  Iter   5  100.00%  Refine biparts
00:03:28  207 MB(-10%)  Iter   5  100.00%  Refine biparts

On helix, using a single core and ~50MB, MUSCLE took 3min to align 100 sequences of about 3000nts each.

Running a single muscle batch job on Biowulf

Create a batch script file similar to the one below:

#!/bin/bash
set -e

cd /data/$USER/test_data
module load muscle
muscle -in fasta/seqs.fasta -out fasta/seqs_aln.fasta

Submit the script to the batch queue using the default 2 core batch job:

biowulf$ sbatch muscle.sh
biowulf$ jobload -u $USER
     JOBID      RUNTIME     NODES   CPUS    AVG CPU%            MEMORY
                                                              Used/Alloc
     17417     00:01:58      p999      2       50.00      57.1 MB/1.5 GB

Running a swarm of muscle batch jobs on Biowulf

Using the 'swarm' utility, one can submit many jobs to the cluster to run concurrently.

Set up a swarm command file with one command per line (line continuations are allowed). For example, here is a sample file:

muscle -in seqs1.fa -out seqs1_aln.fa
muscle -in seqs2.fa -out seqs2_aln.fa
muscle -in seqs3.fa -out seqs3_aln.fa

Submit this job with

swarm -f cmdfile --module muscle

By default, each line of the command file above will run on two cores using up to 1 GB of memory. If each command requires 8 GB of memory, you must specify this using the '-g #' flag to swarm. e.g.

swarm -g 8 -f cmdfile --module muscle

Running an interactive job on Biowulf

Users may need to run jobs interactively sometimes. Such jobs should not be run on the Biowulf login node. Instead allocate an interactive node as described below, and do the interactive work there.

biowulf$ sinteractive
salloc.exe: Granted job allocation 17418
slurm stepprolog here!
srun: error: x11: no local DISPLAY defined, skipping
                                                    Begin slurm taskprolog!
                                                    End slurm taskprolog!
nodexx$ module load muscle
nodexx$ cd /data/$USER/test_data
nodexx$ muscle -in seq1.fasta -out seq1_aln.fasta
...
nodexx$ exit

If more memory is needed it can be requested with --mem. For example

biowulf$ sinteractive --mem=8g

Documentation