FAMSA

FAMSA on Biowulf

Quick Links

FAMSA: Fast and Accurate Multiple Sequence Alignments. It provides a progressive algorithm for large-scale multiple sequence alignments of proteins.

Latest feature include:

Fast guide tree heuristic called Medoid Tree (-medoidtree switch) for ultra-scale alignments:
- the entire Pfam-A v33.1 in its largest NCBI variant (over 18 thousand families, 60 GB of raw FASTA files) was analyzed in 8 hours,
- the family PF00005 of 3 million ABC transporters was aligned in 5 minutes and 24 GB of RAM.
Remarkable time and memory optimizations - SLINK has been replaced with Prim’s minimum spanning tree algorithm when constructing default (single linkage) guide trees. NOTE: This may change quality results slightly compared to FAMSA 1 due to different ties resolution.
Neighbour joining guide trees (-gt nj option). NOTE: Neighbour joining trees are calculated with a use of original O(N3) algorithm, thus their applicability on large sets is limited (unless they are used as subtrees with Medoid Tree heuristic).
Option for compressing output aligment to gzip (-gz switch).
Compatibility with ARM64 8 architecture (including Apple M1).
Duplicate removal - redundant sequences are by default removed prior the alignment and restored afterwards (feature introduced in revision 2.1.0). This can change output alignments when a family contains duplicates. The old behaviour can be obtained by using -keep-duplicates switch.
Profile-profile alignments (available by specifying two input FASTA files; introduced in revision 2.2.0).

References:

Deorowicz, S., Debudaj-Grabysz, A. & Gudyś, A. FAMSA: Fast and accurate multiple sequence alignment of huge protein families. Sci Rep 6, 33964 (2016).

Documentation

FAMSA Github Site

Important Notes

Module Name: famsa (see the modules page for more information)
Multithreaded
Environment variables set
- FAMSA_TESTDATA: Location of example files

Interactive job

Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program.
Sample session (user input in bold):

[user@biowulf]$ sinteractive -c4 --mem=8G
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144 ~]$ module load famsa

[user@cn3144 ~]$ cp -r $FAMSA_TESTDATA .

[user@cn3144 ~]$ cd TEST_DATA

[user@cn3144 ~]$ famsa -t 4 adeno_fiber/adeno_fiber sl.aln
FAMSA (Fast and Accurate Multiple Sequence Alignment)
  version 2.2.2- (2022-10-09)
  S. Deorowicz, A. Debudaj-Grabysz, A. Gudys

Done!

[user@cn3144 ~]$ famsa -t 4 -medoidtree -gt upgma hemopexin/hemopexin upgma.medoid.aln
FAMSA (Fast and Accurate Multiple Sequence Alignment)
  version 2.2.2- (2022-10-09)
  S. Deorowicz, A. Debudaj-Grabysz, A. Gudys

Done!

[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226

[user@biowulf ~]$

Batch job

Most jobs should be run as batch jobs.

Create a batch input file (e.g. famsa.sh). For example:

#!/bin/bash
set -e
module load famsa
famsa -t $SLURM_CPUS_PER_TASK input.fasta ouput.aln

Submit this job using the Slurm sbatch command.

sbatch [--cpus-per-task=#] [--mem=#] famsa.sh

Swarm of Jobs

A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.

Create a swarmfile (e.g. famsa.swarm). For example:

famsa -t $SLURM_CPUS_PER_TASK input1.fasta ouput1.aln
famsa -t $SLURM_CPUS_PER_TASK input2.fasta ouput2.aln
famsa -t $SLURM_CPUS_PER_TASK input3.fasta ouput3.aln
famsa -t $SLURM_CPUS_PER_TASK input4.fasta ouput4.aln

Submit this job using the swarm command.

swarm -f famsa.swarm [-g #] [-t #] --module famsa

where

`-g #`	Number of Gigabytes of memory required for each process (1 line in the swarm command file)
`-t #`	Number of threads/CPUs required for each process (1 line in the swarm command file).
`--module famsa`	Loads the famsa module for each subjob in the swarm