FAMSA on Biowulf
FAMSA: Fast and Accurate Multiple Sequence Alignments. It provides a progressive algorithm for large-scale multiple sequence alignments of proteins.
Latest feature include:
- Fast guide tree heuristic called Medoid Tree (-medoidtree switch) for ultra-scale alignments:
- the entire Pfam-A v33.1 in its largest NCBI variant (over 18 thousand families, 60 GB of raw FASTA files) was analyzed in 8 hours,
- the family PF00005 of 3 million ABC transporters was aligned in 5 minutes and 24 GB of RAM.
- Remarkable time and memory optimizations - SLINK has been replaced with Prim’s minimum spanning tree algorithm when constructing default (single linkage) guide trees. NOTE: This may change quality results slightly compared to FAMSA 1 due to different ties resolution.
- Neighbour joining guide trees (-gt nj option). NOTE: Neighbour joining trees are calculated with a use of original O(N3) algorithm, thus their applicability on large sets is limited (unless they are used as subtrees with Medoid Tree heuristic).
- Option for compressing output aligment to gzip (-gz switch).
- Compatibility with ARM64 8 architecture (including Apple M1).
- Duplicate removal - redundant sequences are by default removed prior the alignment and restored afterwards (feature introduced in revision 2.1.0). This can change output alignments when a family contains duplicates. The old behaviour can be obtained by using -keep-duplicates switch.
- Profile-profile alignments (available by specifying two input FASTA files; introduced in revision 2.2.0).
References:
- Deorowicz, S., Debudaj-Grabysz, A. & Gudyś, A. FAMSA: Fast and accurate multiple sequence alignment of huge protein families. Sci Rep 6, 33964 (2016).
Documentation
Important Notes
- Module Name: famsa (see the modules page for more information)
- Multithreaded
- Environment variables set
- FAMSA_TESTDATA: Location of example files
Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.
Allocate an interactive session and run the program.
Sample session (user input in bold):
[user@biowulf]$ sinteractive -c4 --mem=8G salloc.exe: Pending job allocation 46116226 salloc.exe: job 46116226 queued and waiting for resources salloc.exe: job 46116226 has been allocated resources salloc.exe: Granted job allocation 46116226 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn3144 are ready for job [user@cn3144 ~]$ module load famsa [user@cn3144 ~]$ cp -r $FAMSA_TESTDATA . [user@cn3144 ~]$ cd TEST_DATA [user@cn3144 ~]$ famsa -t 4 adeno_fiber/adeno_fiber sl.aln FAMSA (Fast and Accurate Multiple Sequence Alignment) version 2.2.2- (2022-10-09) S. Deorowicz, A. Debudaj-Grabysz, A. Gudys Done! [user@cn3144 ~]$ famsa -t 4 -medoidtree -gt upgma hemopexin/hemopexin upgma.medoid.aln FAMSA (Fast and Accurate Multiple Sequence Alignment) version 2.2.2- (2022-10-09) S. Deorowicz, A. Debudaj-Grabysz, A. Gudys Done! [user@cn3144 ~]$ exit salloc.exe: Relinquishing job allocation 46116226 [user@biowulf ~]$
Batch job
Most jobs should be run as batch jobs.
Create a batch input file (e.g. famsa.sh). For example:
#!/bin/bash set -e module load famsa famsa -t $SLURM_CPUS_PER_TASK input.fasta ouput.aln
Submit this job using the Slurm sbatch command.
sbatch [--cpus-per-task=#] [--mem=#] famsa.sh
Swarm of Jobs
A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.
Create a swarmfile (e.g. famsa.swarm). For example:
famsa -t $SLURM_CPUS_PER_TASK input1.fasta ouput1.aln famsa -t $SLURM_CPUS_PER_TASK input2.fasta ouput2.aln famsa -t $SLURM_CPUS_PER_TASK input3.fasta ouput3.aln famsa -t $SLURM_CPUS_PER_TASK input4.fasta ouput4.aln
Submit this job using the swarm command.
swarm -f famsa.swarm [-g #] [-t #] --module famsawhere
-g # | Number of Gigabytes of memory required for each process (1 line in the swarm command file) |
-t # | Number of threads/CPUs required for each process (1 line in the swarm command file). |
--module famsa | Loads the famsa module for each subjob in the swarm |