sortmeRNA: next-generation sequence filtering and alignment tool

SortMeRNA is a local sequence alignment tool for filtering, mapping and clustering. The core algorithm is based on approximate seeds and allows for sensitive analysis of NGS reads. The main application of SortMeRNA is filtering rRNA from metatranscriptomic data. SortMeRNA takes as input a file of reads (fasta or fastq format) and one or multiple rRNA database file(s), and sorts apart aligned and rejected reads into two files specified by the user.

References:

Documentation
Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Sample session:

[user@biowulf]$ sinteractive --gres=lscratch:10 --mem=8g 

[user@cn3329 ~]$ module load sortmeRNA  
[+] Loading singularity  3.10.0  on cn3329
[+] Loading sortmeRNA 4.3.6  ...
[user@cn3329 ~]$ sortmerna -h 

  Program:      SortMeRNA version 4.3.6
  Copyright:    2016-2020 Clarity Genomics BVBA:
                Turnhoutseweg 30, 2340 Beerse, Belgium
                2014-2016 Knight Lab:
                Department of Pediatrics, UCSD, La Jolla
                2012-2014 Bonsai Bioinformatics Research Group:
                LIFL, University Lille 1, CNRS UMR 8022, INRIA Nord-Europe
  Disclaimer:   SortMeRNA comes with ABSOLUTELY NO WARRANTY; without even the
                implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
                See the GNU Lesser General Public License for more details.
  Contributors: Jenya Kopylova   jenya.kopylov@gmail.com
                Laurent Noé      laurent.noe@lifl.fr
                Pierre Pericard  pierre.pericard@lifl.fr
                Daniel McDonald  wasade@gmail.com
                Mikaël Salson    mikael.salson@lifl.fr
                Hélène Touzet    helene.touzet@lifl.fr
                Rob Knight       robknight@ucsd.edu

  Usage:   sortmerna -ref FILE [-ref FILE] -reads FWD_READS [-reads REV_READS] [OPTIONS]:
  -------------------------------------------------------------------------------------------------------------
  | option            type-format           description                                          default      |
  -------------------------------------------------------------------------------------------------------------

    [REQUIRED]
    --ref             PATH        Required  Reference file (FASTA) absolute or relative path.

       Use mutliple times, once per a reference file


    --reads           PATH        Required  Raw reads file (FASTA/FASTQ/FASTA.GZ/FASTQ.GZ).

       Use twice for files with paired reads.
       The file extensions are Not important. The program automatically
       recognizes the file format as flat/compressed, fasta/fastq



    [COMMON]
    --workdir         PATH        Optional  Workspace directory                         USRDIR/sortmerna/run/

       Default structure: WORKDIR/
                              idx/   (References index)
                              kvdb/  (Key-value storage for alignments)
                              out/   (processing output)
                              readb/ (pre-processed reads/index)


    --kvdb            PATH        Optional  Directory for Key-value database            WORKDIR/kvdb

       KVDB is used for storing the alignment results.


    --idx-dir         PATH        Optional  Directory for storing Reference index.      WORKDIR/idx


    --readb           PATH        Optional  Storage for pre-processed reads             WORKDIR/readb/

       Directory storing the split reads, or the random access index of compressed reads


    --fastx           BOOL        Optional  Output aligned reads into FASTA/FASTQ file
    --sam             BOOL        Optional  Output SAM alignment for aligned reads.


    --SQ              BOOL        Optional  Add SQ tags to the SAM file


    --blast           STR         Optional  output alignments in various Blast-like formats

       Sample values: '0'                    - pairwise
                      '1'                    - tabular (Blast - m 8 format)
                      '1 cigar'              - tabular + column for CIGAR
                      '1 cigar qcov'         - tabular + columns for CIGAR and query coverage
                      '1 cigar qcov qstrand' - tabular + columns for CIGAR, query coverage,
                                               and strand


    --aligned         STR/BOOL    Optional  Aligned reads file prefix [dir/][pfx]       WORKDIR/out/aligned

       Directory and file prefix for aligned output i.e. each
       output file goes into the specified directory with the given prefix.
       The appropriate extension: (fasta|fastq|blast|sam|etc) is automatically added.
       Both 'dir' and 'pfx' are optional.
       The 'dir' can be a relative or an absolute path.
       If 'dir' is not specified, the output is created in the WORKDIR/out/
       If 'pfx' is not specified, the prefix 'aligned' is used
       Examples:
       '-aligned $MYDIR/dir_1/dir_2/1' -> $MYDIR/dir_1/dir_2/1.fasta
       '-aligned dir_1/apfx'           -> $PWD/dir_1/apfx.fasta
       '-aligned dir_1/'               -> $PWD/aligned.fasta
       '-aligned apfx'                 -> $PWD/apfx.fasta
       '-aligned  (no argument)'       -> WORKDIR/out/aligned.fasta


    --other           STR/BOOL    Optional  Non-aligned reads file prefix [dir/][pfx]   WORKDIR/out/other

       Directory and file prefix for non-aligned output i.e. each
       output file goes into the specified directory with the given prefix.
       The appropriate extension: (fasta|fastq|blast|sam|etc) is automatically added.
       Must be used with 'fastx'.
       Both 'dir' and 'pfx' are optional.
       The 'dir' can be a relative or an absolute path.
       If 'dir' is not specified, the output is created in the WORKDIR/out/
       If 'pfx' is not specified, the prefix 'other' is used
       Examples:
       '-other $MYDIR/dir_1/dir_2/1' -> $MYDIR/dir_1/dir_2/1.fasta
       '-other dir_1/apfx'           -> $PWD/dir_1/apfx.fasta
       '-other dir_1/'               -> $PWD/dir_1/other.fasta
       '-other apfx'                 -> $PWD/apfx.fasta
       '-other  (no argument)'       -> aligned_out/other.fasta
                                        i.e. the same output directory
                                        as used for aligned output


    --num_alignments  INT         Optional  Positive integer (INT >=0).

       If used with '-no-best' reports first INT alignments per read reaching
       E-value threshold, which allows to lower the CPU time and memory use.
       Otherwise outputs INT best alignments.
       If INT = 0, all alignments are output


    --no-best         BOOL        Optional  Disable best alignments search                          False

       The 'best' alignment is the highest scoring alignment out of All alignments of a read,
       and the read can potentially be aligned (reaching E-value threshold) to multiple reference
       sequences.
       By default the program searches for best alignments i.e. performs an exhaustive search
       over all references. Using '-no-best' will make the program to search just
       the first N alignments, where N is set using '-num_alignments' i.e. 1 by default.


    --min_lis         INT         Optional  Search only alignments that have the LIS                2
                                            of at least N seeds long

       LIS stands for Longest Increasing Subsequence. It is computed using seeds, which
       are k-mers common to the read and the reference sequence. Sorted sequences of such seeds
       are used to filter the candidate references prior performing the Smith-Waterman alignment.


    --print_all_reads BOOL        Optional  Output null alignment strings for non-aligned reads     False
                                            to SAM and/or BLAST tabular files

    --paired          BOOL        Optional  Flags paired reads                                      False

        If a single reads file is provided, use this option to indicate
        the file contains interleaved paired reads when neither
        'paired_in' | 'paired_out' | 'out2' | 'sout' are specified.


    --paired_in       BOOL        Optional  Flags the paired-end reads as Aligned,                  False
                                            when either of them is Aligned.

        With this option both reads are output into Aligned FASTA/Q file
        Must be used with 'fastx'.
        Mutually exclusive with 'paired_out'.


    --paired_out      BOOL        Optional  Flags the paired-end reads as Non-aligned,              False
                                            when either of them is non-aligned.

        With this option both reads are output into Non-Aligned FASTA/Q file
        Must be used with 'fastx'.
        Mutually exclusive with 'paired_in'.


    --out2            BOOL        Optional  Output paired reads into separate files.                False

       Must be used with 'fastx'.
       If a single reads file is provided, this options implies interleaved paired reads
       When used with 'sout', four (4) output files for aligned reads will be generated:
       'aligned-paired-fwd, aligned-paired-rev, aligned-singleton-fwd, aligned-singleton-rev'.
       If 'other' option is also used, eight (8) output files will be generated.


    --sout            BOOL        Optional  Separate paired and singleton aligned reads.            False

       To be used with 'fastx'.
       If a single reads file is provided, this options implies interleaved paired reads
       Cannot be used with 'paired_in' | 'paired_out'


    --zip-out         STR/BOOL    Optional  Controls the output compression                        '-1'

       By default the report files are produced in the same format as the input i.e.
       if the reads files are compressed (gz), the output is also compressed.
       The default behaviour can be overriden by using '-zip-out'.
       The possible values: '1/true/t/yes/y'
                            '0/false/f/no/n'
                            '-1' (the same format as input - default)
       The values are Not case sensitive i.e. 'Yes, YES, yEs, Y, y' are all OK
       Examples:
       '-reads freads.gz -zip-out n' : generate flat output when the input is compressed
       '-reads freads.flat -zip-out' : compress the output when the input files are flat


    --match           INT         Optional  SW score (positive integer) for a match.                2

    --mismatch        INT         Optional  SW penalty (negative integer) for a mismatch.          -3

    --gap_open        INT         Optional  SW penalty (positive integer) for introducing a gap.    5

    --gap_ext         INT         Optional  SW penalty (positive integer) for extending a gap.      2

    -e                DOUBLE      Optional  E-value threshold.                                      1

       Defines the 'statistical significance' of a local alignment.
       Exponentially correllates with the Minimal Alignment score.
       Higher E-values (100, 1000, ...) cause More reads to Pass the alignment threshold


    -F                BOOL        Optional  Search only the forward strand.                         False

    -N                BOOL        Optional  SW penalty for ambiguous letters (N's) scored
                                            as --mismatch

    -R                BOOL        Optional  Search only the reverse-complementary strand.           False


    [OTU_PICKING]
    --id              INT         Optional  %%id similarity threshold (the alignment                0.97
                                            must still pass the E-value threshold).

    --coverage        INT         Optional  %%query coverage threshold (the alignment must          0.97
                                            still pass the E-value threshold)

    --de_novo_otu     BOOL        Optional  Output FASTA file with 'de novo' reads                  False

       Read is 'de novo' if its alignment score passes E-value threshold, but both the identity
       '-id', and the '-coverage' are below their corresponding thresholds
       i.e. ID < %%id and COV < %%cov


    --otu_map         BOOL        Optional  Output OTU map (input to QIIMEs make_otu_table.py).    False
                                            Cannot be used with 'no-best because
                                            the grouping is done around the best alignment


    [ADVANCED]
    --passes          INT,INT,INT Optional  Three intervals at which to place the seed on           L,L/2,3
                                             the read (L is the seed length)

    --edges           INT         Optional  Number (or percent if INT followed by %% sign) of       4
                                            nucleotides to add to each edge of the read
                                            prior to SW local alignment

    --num_seeds       BOOL        Optional  Number of seeds matched before searching                2
                                            for candidate LIS

    --full_search     INT         Optional  Search for all 0-error and 1-error seed                 False
                                            matches in the index rather than stopping
                                            after finding a 0-error match (<1%% gain in
                                            sensitivity with up four-fold decrease in speed)

    --pid             BOOL        Optional  Add pid to output file names.                           False

    -a                INT         Optional  DEPRECATED in favour of '-threads'. Number of           numCores
                                            processing threads to use.
                                            Automatically redirects to '-threads'

    --threads         INT         Optional  Number of Processing threads to use                     2


    [INDEXING]
    --index           INT         Optional  Build reference database index                          2

       By default when this option is not used, the program checks the reference index and
       builds it if not already existing.
       This can be changed by using '-index' as follows:
       '-index 0' - skip indexing. If the index does not exist, the program will terminate
                                and warn to build the index prior performing the alignment
       '-index 1' - only perform the indexing and terminate
       '-index 2' - the default behaviour, the same as when not using this option at all


    -L                DOUBLE      Optional  Indexing: seed length.                                  18

    -m                DOUBLE      Optional  Indexing: the amount of memory (in Mbytes) for          3072
                                            building the index.

    -v                BOOL        Optional  Produce verbose output when building the index          True

    --interval        INT         Optional  Indexing: Positive integer: index every Nth L-mer in    1
                                            the reference database e.g. '-interval 2'.

    --max_pos         INT         Optional  Indexing: maximum (integer) number of positions to      1000
                                            store for each unique L-mer.
                                            If 0 - all positions are stored.


    [HELP]
    -h                BOOL        Optional  Print help information

    --version         BOOL        Optional  Print SortMeRNA version number


    [DEVELOPER]
    --dbg_put_db      BOOL        Optional
    --cmd             BOOL        Optional  Launch an interactive session (command prompt)          False

    --task            INT         Optional  Processing Task                                         4

       Possible values: 0 - align. Only perform alignment
                        1 - post-processing (log writing)
                        2 - generate reports
                        3 - align and post-process
                        4 - all


    --dbg-level       INT         Optional  Debug level                                             0

      Controls verbosity of the execution trace. Default value of 0 corresponds to
      the least verbose output.
      The highest value currently is 2.
Running a test example:
[user@cn3329 user]$ cp $SORTMERNA_DATA/* . 
[user@cn3329 user]$ sortmerna -ref ref_GQ099317_forward_and_rc.fasta   -reads set2_environmental_study_550_amplicon.fasta -workdir out
[process:1393] === Options processing starts ... ===
Found value: /opt/conda/envs/sortmerna/bin/sortmerna
Found flag: -ref
Found value: ref_GQ099317_forward_and_rc.fasta of previous flag: -ref
Found flag: -reads
Found value: set2_environmental_study_550_amplicon.fasta of previous flag: -reads
Found flag: -workdir
Found value: out of previous flag: -workdir
[opt_workdir:995] Using WORKDIR: "/gpfs/gsfs7/users/user/sortmeRNA/out" as specified
[process:1483] Processing option: reads with value: set2_environmental_study_550_amplicon.fasta
[opt_reads:98] Processing reads file [1] out of total [1] files
[process:1483] Processing option: ref with value: ref_GQ099317_forward_and_rc.fasta
[opt_ref:158] Processing reference [1] out of total [1] references
[opt_ref:206] File "/gpfs/gsfs7/users/user/sortmeRNA/ref_GQ099317_forward_and_rc.fasta" exists and is readable
[process:1503] === Options processing done ===
[process:1504] Alignment type: [best:1 num_alignments:1 min_lis:2 seeds:2]
[validate_kvdbdir:1248] Key-value DB location "/gpfs/gsfs7/users/user/sortmeRNA/out/kvdb"
[validate_kvdbdir:1284] Creating KVDB directory: "/gpfs/gsfs7/users/user/sortmeRNA/out/kvdb"
[validate_idxdir:1214] Using index directory: "/gpfs/gsfs7/users/user/sortmeRNA/out/idx"
[validate_idxdir:1222] Created index directory - OK
[validate_readb_dir:1306] Using split reads directory : "/gpfs/gsfs7/users/user/sortmeRNA/out/readb"
[validate_readb_dir:1314] Created split reads directory - OK
[validate_aligned_pfx:1335] Checking output directory: "/gpfs/gsfs7/users/user/sortmeRNA/out/out"
[validate:1533] No output format has been chosen (fastx|sam|blast|otu_map). Using default 'blast'
[main:62] Running command:
/opt/conda/envs/sortmerna/bin/sortmerna -ref ref_GQ099317_forward_and_rc.fasta -reads set2_environmental_study_550_amplicon.fasta -workdir out
[build_index:1127] ==== Index building started ====
[build_index:1190] Begin indexing file ref_GQ099317_forward_and_rc.fasta of size: 244 under index name out/idx/10850222056975404418
  done.

[build_index:2108] ==== Done index building in 0.0130122 sec ====

[init:108] Readfeed init started
[define_format:877] file: "set2_environmental_study_550_amplicon.fasta" is FASTA flat ASCII
[count_reads:915] started count  ...
[next:322] EOF FWD reached. Total reads: 100000
[count_reads:945] done count. Elapsed time: 0.0979533 sec. Total reads: 100000
[init_split_files:967] added file: out/readb/fwd_0.fa
[init_split_files:967] added file: out/readb/fwd_1.fa
[split:605] start splitting. Using number of splits equals number of processing threads: 2
[next:322] EOF FWD reached. Total reads: 100000
[split:717] Done splitting. Reads count: 100000 Runtime sec: 0.0964581

[init:135] Readfeed init done in sec [0.220833]
[store_to_db:292] Stored Reads statistics to DB:
     all_reads_count= 100000 all_reads_len= 12815500 min_read_len= 75 max_read_len= 152 total_aligned= 0 total_aligned_id= 0 total_aligned_cov= 0 total_aligned_id_cov= 0 total_denovo= 0 num_short= 0 reads_matched_per_db= TODO is_stats_calc= 0 is_total_reads_mapped_cov= 0

[align:143] ==== Starting alignment ====
[align:146] Number of cores: 64
[align:163] Using number of Processor threads: 2
[Refstats:60] Index Statistics calculation starts ... done in: 0.273445 sec
[align:185] Loading index: 0 part: 1/1 Memory KB: 14 ...
[align:190] done in [0.0122706] sec Memory KB: 23
[align:193] Loading references ...
[align:197] done in [1.4727e-05] sec. Memory KB: 23
[align2:70] Processor 1 thread 23456112350976 started
[align2:70] Processor 0 thread 23456110249728 started
[next:455] EOF REV reached. Total reads: 50000
[align2:133] Processor 1 thread 23456112350976 done. Processed 50000 reads. Skipped already processed: 0 reads Aligned reads (passing E-value): 0 Runtime sec: 0.446079
[next:455] EOF FWD reached. Total reads: 50000
[align2:133] Processor 0 thread 23456110249728 done. Processed 50000 reads. Skipped already processed: 0 reads Aligned reads (passing E-value): 0 Runtime sec: 0.45168
[align:220] done index: 0 part: 1 in 0.451754 sec Memory KB: 23
[align:227] Index and References unloaded in 0.00035275 sec. Memory KB: 23
[align:237] ==== Done alignment in 0.464565 sec ====

[store_to_db:292] Stored Reads statistics to DB:
     all_reads_count= 100000 all_reads_len= 12815500 min_read_len= 75 max_read_len= 152 total_aligned= 0 total_aligned_id= 0 total_aligned_cov= 0 total_aligned_id_cov= 0 total_denovo= 0 num_short= 0 reads_matched_per_db= TODO is_stats_calc= 0 is_total_reads_mapped_cov= 0

[writeSummary:179] ==== Starting summary of alignment statistics ====
[Refstats:60] Index Statistics calculation starts ... done in: 0.271057 sec
[write:62] Using summary file: out/out/aligned.log
[writeSummary:185] ==== Done summary in sec [0.271671] ====

[writeReports:169] === Report generation starts ===
[writeReports:184] Restored Readstats from DB: 1
[Refstats:60] Index Statistics calculation starts ... done in: 0.275999 sec
[writeReports:199] loading reference 0 part 1/1 ... done in 2.4086e-05 sec
[report:93] Report Processor: 1 thread: 23456110249728 started. Memory KB: 23
[report:93] Report Processor: 0 thread: 23456112350976 started. Memory KB: 23
[next:455] EOF FWD reached. Total reads: 50000
[report:152] Report processor: 0 thread: 23456112350976 done. Processed reads: 50000 Invalid reads: 0 Memory KB: 23
[next:455] EOF REV reached. Total reads: 50000
[report:152] Report processor: 1 thread: 23456110249728 done. Processed reads: 50000 Invalid reads: 0 Memory KB: 23
[writeReports:220] done reference 0 part: 1 in 0.157432 sec
[writeReports:226] references unloaded in 5.41e-07 sec Memory KB: 23
[merge:90] deleted out/out/aligned_1.blast
[strip_path_sfx:169] moving out/out/aligned_0.blast -> "out/out/aligned.blast"
[writeReports:268] === done Reports in 0.433945 sec ===
End the interactive session:
[user@cn3329 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226