Biowulf High Performance Computing at the NIH
sortmeRNSA: next-generation sequence filtering and alignment tool

SortMeRNA is a local sequence alignment tool for filtering, mapping and clustering. The core algorithm is based on approximate seeds and allows for sensitive analysis of NGS reads. The main application of SortMeRNA is filtering rRNA from metatranscriptomic data. SortMeRNA takes as input a file of reads (fasta or fastq format) and one or multiple rRNA database file(s), and sorts apart aligned and rejected reads into two files specified by the user.

References:

Documentation
Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Sample session:

[user@biowulf]$ sinteractive --gres=lscratch:10 --mem=8g 

[user@cn3329 ~]$ module load sortmeRNA  
[+] Loading python 3.6  ...
[+] Loading sortmeRNA 3.0.3  ... 
[user@cn3329 ~]$ sortmerna -h 

  Program:      SortMeRNA version 3.0.3
  Copyright:    2016-2019 Clarity Genomics BVBA:
                Turnhoutseweg 30, 2340 Beerse, Belgium
                2014-2016 Knight Lab:
                Department of Pediatrics, UCSD, La Jolla
                2012-2014 Bonsai Bioinformatics Research Group:
                LIFL, University Lille 1, CNRS UMR 8022, INRIA Nord-Europe
  Disclaimer:   SortMeRNA comes with ABSOLUTELY NO WARRANTY; without even the
                implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
                See the GNU Lesser General Public License for more details.
  Contributors: Jenya Kopylova   jenya.kopylov@gmail.com
                Laurent Noé      laurent.noe@lifl.fr
                Pierre Pericard  pierre.pericard@lifl.fr
                Daniel McDonald  wasade@gmail.com
                Mikaël Salson    mikael.salson@lifl.fr
                Hélène Touzet    helene.touzet@lifl.fr
                Rob Knight       robknight@ucsd.edu


  usage:   ./sortmerna --ref db.fasta,db.idx --reads file.fa --aligned base_name_output [OPTIONS]:
  OR
  usage:   ./sortmerna --ref db.fasta,db.idx --reads-gz file.fa.gz --aligned base_name_output [OPTIONS]:

  -------------------------------------------------------------------------------------------------------------
  | option              type-format       description                                              default    |
  -------------------------------------------------------------------------------------------------------------
  [REQUIRED OPTIONS]:
    --ref              STRING,STRING   FASTA reference file:index file                           mandatory
                                         If passing multiple reference files, separate them
                                         using the delimiter ':' (Linux) or ';' (Windows),
                 (ex. --ref /path/to/file1.fasta,/path/to/index1:/path/to/file2.fasta,path/to/index2)
    --reads            STRING          FASTA/FASTQ raw reads file                                mandatory
        OR
    --reads-gz         STRING          FASTA/FASTQ compressed (with gzip) reads file             mandatory
    --aligned          STRING          aligned reads filepath + base file name                   mandatory
                                         (appropriate extension will be added)

  [COMMON OPTIONS]:
    --other           STRING           rejected reads filepath + base file name
                                         (appropriate extension will be added)                   
    --fastx           BOOL             output FASTA/FASTQ file                                   off
                                         (for aligned and/or rejected reads)
    --sam             BOOL             output SAM alignment                                      off
                                         (for aligned reads only)
    --SQ              BOOL             add SQ tags to the SAM file                               off
    --blast           STRING           output alignments in various Blast-like formats
                                           '0' - pairwise
                                           '1' - tabular (Blast -m 8 format)
                                           '1 cigar' - tabular + column for CIGAR
                                           '1 cigar qcov' - tabular + columns for CIGAR
                                                      and query coverage
                                           '1 cigar qcov qstrand' - tabular + columns for CIGAR,
                                                                query coverage and strand
    --log             BOOL             output overall statistics                                 off
    --num_alignments  INT              report first INT alignments per read reaching E-value     -1
                                        (--num_alignments 0 signifies all alignments will be output)
       OR (default)
    --best            INT              report INT best alignments per read reaching E-value      1
                                         by searching --min_lis INT candidate alignments
                                        (--best 0 signifies all candidate alignments will be searched)
    --min_lis         INT              search all alignments having the first INT longest LIS    2
                                         LIS stands for Longest Increasing Subsequence, it is
                                         computed using seeds' positions to expand hits into
                                         longer matches prior to Smith-Waterman alignment.
   --print_all_reads  BOOL             output null alignment strings for non-aligned reads       off
                                         to SAM and/or BLAST tabular files
    --paired_in       BOOL             both paired-end reads go in --aligned fasta/q file        off
                                        (interleaved reads only, see Section 4.2.4 of User Manual)
    --paired_out       BOOL            both paired-end reads go in --other fasta/q file          off
                                       (interleaved reads only, see Section 4.2.4 of User Manual)
    --match           INT              SW score (positive integer) for a match                   2
    --mismatch        INT              SW penalty (negative integer) for a mismatch              -3
    --gap_open        INT              SW penalty (positive integer) for introducing a gap       5
    --gap_ext         INT              SW penalty (positive integer) for extending a gap         2
    -N                INT              SW penalty for ambiguous letters (N's)
                                         scored as --mismatch
    -F                BOOL             search only the forward strand                            off
    -R                BOOL             search only the reverse-complementary strand              off
    -e                DOUBLE           E-value threshold                                         1
    -v                BOOL             verbose                                                   off

  [OTU PICKING OPTIONS]:
    --id              DOUBLE           %%id similarity threshold (the alignment must             0.97
                                         still pass the E-value threshold)

    --coverage        DOUBLE           %%query coverage threshold (the alignment must            0.97
                                         still pass the E-value threshold)

    --de_novo_otu     BOOL             FASTA/FASTQ file for reads matching database < %%id       off
                                         (set using --id) and < %%cov (set using --coverage)
                                         (alignment must still pass the E-value threshold)
    --otu_map         BOOL             output OTU map (input to QIIME’s make_otu_table.py)       off

  [ADVANCED OPTIONS] (see SortMeRNA user manual for more details):
    --passes          INT,INT,INT      three intervals at which to place the seed on the read    L,L/2,3
                                         (L is the seed length set in ./indexdb_rna)
    --edges           INT              number (or percent if INT followed by %% sign) of         4
                                         nucleotides to add to each edge of the read
                                         prior to SW local alignment
    --num_seeds       INT              number of seeds matched before searching                  2
                                         for candidate LIS
    --full_search     BOOL             search for all 0-error and 1-error seed                   off
                                         matches in the index rather than stopping
                                         after finding a 0-error match (<1%% gain in
                                         sensitivity with up four-fold decrease in speed)
    --pid             BOOL             add pid to output file names                              off

   [HELP]:
    -h                BOOL             help
    --version         BOOL             SortMeRNA version number

  [DEVELOPER OPTIONS]:
    --cmd             BOOL             launch an interactive session (command prompt)
    --task            INT              Processing Task
                                           0 - align Only perform alignment
                                           1 - post-processing (log writing)
                                           2 - generate reports
                                           3 - align and post−process
                                           4 - all (default)

    -d                STRING           key-value datastore FULL folder path                      USERDIR/kvdb
    -a                INT              number of threads to use                                  numCores
    --threads         INT:INT:INT      number of Read:Write:Process threads to use               1:1:numCores
    --thpp            INT:INT:INT      number of Post-Processing Read:Process threads to use     1:1
    --threp           INT:INT:INT      number of Report Read:Process threads to use              1:1
Running a test example:
[user@cn3329 user]$ cp $SORTMERNA_TESTS/* . 
[user@cn3329 user]$ python test_sortmerna.py
test_blast_format_0
test_blast_format_0: ['indexdb', '--ref', '/tmp/temp_subject_7zj42gxo.fasta,/tmp/tmpcse54q49/subject_str', '-v']
b'\n  Program:      SortMeRNA version 3.0.3\n  Copyright:    2016-2019 Clarity Genomics BVBA:\n                Turnhoutseweg 30, 2340 Beerse, Belgium\n                2014-2016 Knight Lab:\n                Department of Pediatrics, UCSD, La Jolla\n                2012-2014 Bonsai Bioinformatics Research Group:\n                LIFL, University Lille 1, CNRS UMR 8022, INRIA Nord-Europe\n  Disclaimer:   SortMeRNA comes with ABSOLUTELY NO WARRANTY; without even the\n                implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.\n                See the GNU Lesser General Public License for more details.\n  Contributors: Jenya Kopylova   jenya.kopylov@gmail.com \n                Laurent No\xc3\xa9      laurent.noe@lifl.fr\n                Pierre Pericard  pierre.pericard@lifl.fr\n                Daniel McDonald  wasade@gmail.com\n                Mika\xc3\xabl Salson    mikael.salson@lifl.fr\n                H\xc3\xa9l\xc3\xa8ne Touzet    helene.touzet@lifl.fr\n                Rob Knight       robknight@ucsd.edu\n\n\n  Parameters summary: \n    K-mer size: 19\n    K-mer interval: 1\n    Maximum positions to store per unique K-mer: 10000\n\n  Total number of databases to index: 1\n\n  Begin indexing file \x1b[0;34m/tmp/temp_subject_7zj42gxo.fasta\x1b[0m under index name \x1b[0;34m/tmp/tmpcse54q49/subject_str\x1b[0m: \n  Collecting sequence distribution statistics ..  done  [0.000021 sec]\n\n  start index part # 0: \n    (1/3) building burst tries .. done  [0.002107 sec]\n    (2/3) building CMPH hash .. done  [0.001686 sec]\n    (3/3) building position lookup tables .. done [0.000413 sec]\n    total number of sequences in this part = 1\n      temporary file was here: /tmp/sortmerna_keys_40577.txt\n      writing kmer data to /tmp/tmpcse54q49/subject_str.kmer_0.dat\n      writing burst tries to /tmp/tmpcse54q49/subject_str.bursttrie_0.dat\n      writing position lookup table to /tmp/tmpcse54q49/subject_str.pos_0.dat\n      writing nucleotide distribution statistics to /tmp/tmpcse54q49/subject_str.stats\n    done.\n\n'
test_blast_format_0: ['sortmerna', '--ref', '/tmp/temp_subject_7zj42gxo.fasta,/tmp/tmpcse54q49/subject_str', '--reads', '/tmp/temp_query_6g4gqdd5.fasta', '--aligned', '/tmp/tmpcse54q49/aligned', '--sam', '--blast', '0', '-v', '-d', '/tmp/tmpcse54q49/kvdb', '--task', '4']
b'test_kvdb_path: Using Key-value DB location: /tmp/tmpcse54q49/kvdb\ndirExists: Path does not exist: /tmp/tmpcse54q49/kvdb\ntest_kvdb_path: Directory /tmp/tmpcse54q49/kvdb does not exists - will attempt to create\n  Program:      SortMeRNA version 3.0.3\n  Copyright:    2016-2019 Clarity Genomics BVBA:\n                Turnhoutseweg 30, 2340 Beerse, Belgium\n                2014-2016 Knight Lab:\n                Department of Pediatrics, UCSD, La Jolla\n                2012-2014 Bonsai Bioinformatics Research Group:\n                LIFL, University Lille 1, CNRS UMR 8022, INRIA Nord-Europe\n  Disclaimer:   SortMeRNA comes with ABSOLUTELY NO WARRANTY; without even the\n                implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.\n                See the GNU Lesser General Public License for more details.\n  Contributors: Jenya Kopylova   jenya.kopylov@gmail.com \n                Laurent No\xc3\xa9      laurent.noe@lifl.fr\n                Pierre Pericard  pierre.pericard@lifl.fr\n                Daniel McDonald  wasade@gmail.com\n                Mika\xc3\xabl Salson    mikael.salson@lifl.fr\n                H\xc3\xa9l\xc3\xa8ne Touzet    helene.touzet@lifl.fr\n                Rob Knight       robknight@ucsd.edu\n\nmain:55 Running task ALIGN_REPORT: 4\nReadstats::calculate starting ...   Readstats::calculate done. Elapsed time: 0.00 sec. Reads processed: 1\nalign:375 Using default number of Processor threads equals num CPU cores: 56\nNumber of cores: 56 Read threads:  1 Write threads: 1 Processor threads: 56\nThreadPool:36 initialized Pool with: [58] threads\nread_queue created\nwrite_queue created\nRefstats:32 Index Statistics calculation Start ... Done. Time elapsed: 0.54 sec\nalign:414 Loading index 0 part 1/1 ... done [0.02] sec\nalign:425 Loading references  ... done [0.00] sec\nWriter writer_0 thread 46912526321408 started\nreader_0 thread: 46912520017664 started\nProcessor proc_0 thread 46912530523904 started\nProcessor proc_1 thread 46912524220160 
...
test_blast_format_0: Run time: 1.743835210800171
...
test_simulated_amplicon_generic_buffer: Run time: 71.58072853088379
.
----------------------------------------------------------------------
Ran 20 tests in 424.694s

OK (skipped=1)
End the interactive session:
[user@cn3329 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226