trimAl: a tool for automated alignment trimming

trimAl is a tool for the automated removal of spurious sequences or poorly aligned regions from a multiple sequence alignment. It can consider several parameters, alone or in multiple combinations, in order to select the most-reliable positions in the alignment. These include the proportion of sequences with a gap, the level of residue similarity and, if several alignments for the same set of sequences are provided, the consistency level of columns among alignments. Moreover, trimAl is able to manually select a set of columns to be removed from the alignment.

References:

Documentation
Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Sample session:

[user@biowulf]$ sinteractive -c 16 --mem 45g --gres=lscratch:20
[user@cn3144 ~]$ module load trimAl
[+] Loading singularity  3.8.5-1  on cn3144
[+] Loading trimAl  1.5.0   
[user@cn3144 ~]$ ls $TRIMAL_BIN 
readal  statal  trimal
[user@cn3144 ~]$ readal -h 
...
[user@cn3144 ~]$ statal -h 
... 
[user@cn3144 ~]$ trimal -h 

trimAl v1.5.rev0 build[2024-05-27]. 2009-2020. Victor Fernández-Rodríguez, Toni Gabaldón, and Salvador Capella
trimAl webpage: https://trimal.readthedocs.io

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, the last available version.

Please cite:
                trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses.
                Salvador Capella-Gutierrez; Jose M. Silla-Martinez; Toni Gabaldon.
                Bioinformatics 2009, 25:1972-1973.

Basic usage
        trimal -in <inputfile> -out <outputfile> -(other options).

Common options (for a complete list please see the User Guide or visit https://trimal.readthedocs.io):

    -h                          Print this information and show some examples.
    --version                   Print the trimAl version.

    -in <inputfile>             Input file in several formats (clustal, fasta, NBRF/PIR, nexus, phylip3.2, phylip).

    -compareset <inputfile>     Input list of paths for the files containing the alignments to compare.
    -forceselect <inputfile>    Force selection of the given input file in the files comparison method.

    -backtrans <inputfile>      Use a Coding Sequences file to get a backtranslation for a given AA alignment
    -ignorestopcodon            Ignore stop codons in the input coding sequences
    -splitbystopcodon           Split input coding sequences up to first stop codon appearance

    -matrix <inpufile>          Input file for user-defined similarity matrix (default is Blosum62).
    --alternative_matrix <name> Select an alternative similarity matrix already loaded.
                                Only available 'degenerated_nt_identity'

    -out <outputfile>           Output alignment in the same input format (default stdout). (default input format)
    -htmlout <outputfile>       Get a summary of trimal's work in an HTML file.

    -keepheader                 Keep original sequence header including non-alphanumeric characters.
                                Only available for input FASTA format files. (future versions will extend this feature)

    -nbrf                       Output file in NBRF/PIR format
    -mega                       Output file in MEGA format
    -nexus                      Output file in NEXUS format
    -clustal                    Output file in CLUSTAL format

    -fasta                      Output file in FASTA format
    -fasta_m10                  Output file in FASTA format. Sequences name length up to 10 characters.

    -phylip                     Output file in PHYLIP/PHYLIP4 format
    -phylip_m10                 Output file in PHYLIP/PHYLIP4 format. Sequences name length up to 10 characters.
    -phylip_paml                Output file in PHYLIP format compatible with PAML
    -phylip_paml_m10            Output file in PHYLIP format compatible with PAML. Sequences name length up to 10 characters.
    -phylip3.2                  Output file in PHYLIP3.2 format
    -phylip3.2_m10              Output file in PHYLIP3.2 format. Sequences name length up to 10 characters.

    -complementary              Get the complementary alignment.
    -colnumbering               Get the relationship between the columns in the old and new alignment.

    -selectcols { n,l,m-k }     Selection of columns to be removed from the alignment. Range: [0 - (Number of Columns - 1)]. (see User Guide).
    -selectseqs { n,l,m-k }     Selection of sequences to be removed from the alignment. Range: [0 - (Number of Sequences - 1)]. (see User Guide).

    -gt -gapthreshold <n>       1 - (fraction of sequences with a gap allowed). Range: [0 - 1]
    -st -simthreshold <n>       Minimum average similarity allowed. Range: [0 - 1]
    -ct -conthreshold <n>       Minimum consistency value allowed.Range: [0 - 1]
    -cons <n>                   Minimum percentage of the positions in the original alignment to conserve. Range: [0 - 100]

    -nogaps                     Remove all positions with gaps in the alignment.
    -noallgaps                  Remove columns composed only by gaps.
    -keepseqs                   Keep sequences even if they are composed only by gaps.

    -gappyout                   Use automated selection on "gappyout" mode. This method only uses information based on gaps' distribution. (see User Guide).
    -strict                     Use automated selection on "strict" mode. (see User Guide).
    -strictplus                 Use automated selection on "strictplus" mode. (see User Guide).
                               (Optimized for Neighbour Joining phylogenetic tree reconstruction).

    -automated1                 Use a heuristic selection of the automatic method based on similarity statistics. (see User Guide). (Optimized for Maximum Likelihood phylogenetic tree reconstruction).

    -terminalonly               Only columns out of internal boundaries (first and last column without gaps) are
                                candidates to be trimmed depending on the selected method
    --set_boundaries { l,r }    Set manually left (l) and right (r) boundaries - only columns out of these boundaries are
                                candidates to be trimmed depending on the selected method. Range: [0 - (Number of Columns - 1)]
    -block <n>                  Minimum column block size to be kept in the trimmed alignment. Available with manual and automatic (gappyout) methods

    -resoverlap                 Minimum overlap of a positions with other positions in the column to be considered a "good position". Range: [0 - 1]. (see User Guide).
    -seqoverlap                 Minimum percentage of "good positions" that a sequence must have in order to be conserved. Range: [0 - 100](see User Guide).

    -clusters <n>               Get the most Nth representatives sequences from a given alignment. Range: [1 - (Number of sequences)]
    -maxidentity <n>            Get the representatives sequences for a given identity threshold. Range: [0 - 1].

    -w <n>                      (half) Window size, score of position i is the average of the window (i - n) to (i + n).
    -gw <n>                     (half) Window size only applies to statistics/methods based on Gaps.
    -sw <n>                     (half) Window size only applies to statistics/methods based on Similarity.
    -cw <n>                     (half) Window size only applies to statistics/methods based on Consistency.

    -sgc                        Print gap scores for each column in the input alignment.
    -sgt                        Print accumulated gap scores for the input alignment.
    -ssc                        Print similarity scores for each column in the input alignment.
    -sst                        Print accumulated similarity scores for the input alignment.
    -sfc                        Print sum-of-pairs scores for each column from the selected alignment
    -sft                        Print accumulated sum-of-pairs scores for the selected alignment
    -sident                     Print identity scores matrix for all sequences in the input alignment. (see User Guide).
    -soverlap                   Print overlap scores matrix for all sequences in the input alignment. (see User Guide).

Some Examples:

1) Removes all positions in the alignment with gaps in 10% or more of
   the sequences, unless this leaves less than 60% of original alignment.
   In such case, print the 60% best (with less gaps) positions.

   trimal -in <inputfile> -out <outputfile> -gt 0.9 -cons 60

2) As above but, the gap score is averaged over a window starting
   3 positions before and ending 3 positions after each column.

   trimal -in <inputfile> -out <outputfile> -gt 0.9 -cons 60 -w 3

3) Use an automatic method to decide optimal thresholds, based in the gap scores
   from input alignment. (see User Guide for details).

   trimal -in <inputfile> -out <outputfile> -gappyout

4) Use automatic methods to decide optimal thresholds, based on the combination
   of gap and similarity scores. (see User Guide for details).

   trimal -in <inputfile> -out <outputfile> -strictplus

5) Use an heuristic to decide the optimal method for trimming the alignment.
   (see User Guide for details).

   trimal -in <inputfile> -out <outputfile> -automated1

6) Use residues and sequences overlap thresholds to delete some sequences from the
   alignemnt. (see User Guide for details).

   trimal -in <inputfile> -out <outputfile> -resoverlap 0.8 -seqoverlap 75

7) Selection of columns to be deleted from the alignment. The selection can
   be a column number or a column number interval. Start from 0

   trimal -in <inputfile> -out <outputfile> -selectcols { 0,2,3,10,45-60,68,70-78 }

8) Get the complementary alignment from the alignment previously trimmed.

   trimal -in <inputfile> -out <outputfile> -selectcols { 0,2,3,10,45-60,68,70-78 } -complementary

9) Selection of sequences to be deleted from the alignment. Start in 0

   trimal -in <inputfile> -out <outputfile> -selectseqs { 2,4,8-12 }

10) Select the 5 most representative sequences from the alignment

   trimal -in <inputfile> -out <outputfile> -clusters 5
[user@cn3144 ~]$ cp -r $TRIMAL_DATA/* .
Remove all positions in the alignment with gaps in 10% or more of the sequences, unless this leaves less than 60% of original alignment. In such case, print the 60% best (with less gaps) positions:
[user@cn3144 ~]$ trimal -in example.037.AA.bctoNOG.ENOG41099XP.fasta -out example.037.AA.bctoNOG.ENOG41099XP_trimmed.fasta -gt 0.9 -cons 60.
[user@cn3144 ~]$ ls -l example.037.AA.bctoNOG.ENOG41099XP.fasta  example.037.AA.bctoNOG.ENOG41099XP_trimmed.fasta
-rw-r--r-- 1 user staff 230484 Nov 26 10:34 example.037.AA.bctoNOG.ENOG41099XP.fasta
-rw-r--r-- 1 user staff 139498 Nov 26 10:36 example.037.AA.bctoNOG.ENOG41099XP_trimmed.fasta
Use an automatic method to decide optimal thresholds, based in the gap scores from input alignment:
[user@cn3144 ~]$ trimal -in example.019.AA.bctoNOG.ENOG41099HI.fasta -out example.019.AA.bctoNOG.ENOG41099HI_trimmed.fasta -gappyout
[user@cn3144 ~]$ ls -l example.019.AA.bctoNOG.ENOG41099HI.fasta  example.019.AA.bctoNOG.ENOG41099HI_trimmed.fasta
-rw-r--r-- 1 user staff 377416 Nov 26 10:34 example.019.AA.bctoNOG.ENOG41099HI.fasta
-rw-r--r-- 1 user staff 137830 Nov 26 10:57 example.019.AA.bctoNOG.ENOG41099HI_trimmed.fasta
etc.
[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$