From the Documentation:
MMseqs2 (Many-against-Many sequence searching) is a software suite to search and cluster huge protein and nucleotide sequence sets. MMseqs2 is open source GPL-licensed software implemented in C++ for Linux, MacOS, and (as beta version, via cygwin) Windows. The software is designed to run on multiple cores and servers and exhibits very good scalability. MMseqs2 can run 10000 times faster than BLAST. At 100 times its speed it achieves almost the same sensitivity. It can perform profile searches with the same sensitivity as PSI-BLAST at over 400 times its speed.
$MMSEQS_TEST_DATAAllocate an interactive session to follow the mmseqs metagenomic pathogen detection tutorial. Note that for larger databases you would want more memory.
[user@biowulf]$ sinteractive --gres=lscratch:10 -c2 --mem=6g
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job
[user@cn3144]$ cd /lscratch/$SLURM_JOB_ID
[user@cn3144]$ module load mmseqs
[user@cn3144]$ cp ${MMSEQS_TEST_DATA:-none}/* .
Create a taxonomically annotated sequence database from a subset of uniprot
[user@cn3144]$ gunzip uniprot_sprot_2018_03_mapping.tsv.gz
[user@cn3144]$ mmseqs createdb uniprot_sprot_2018_03.fasta.gz uniprot_sprot
createdb uniprot_sprot_2018_03.fasta.gz uniprot_sprot
MMseqs Version:         aabc78c298f35cbc7a4136206c1a83adaa68695f
Database type           0
Shuffle input database  true
Createdb mode           0
Write lookup file       1
Offset of numeric ids   0
Compressed              0
Verbosity               3
Converting sequences
[556915] 2s 838ms
Time for merging to uniprot_sprot_h: 0h 0m 0s 174ms
Time for merging to uniprot_sprot: 0h 0m 0s 358ms
Database type: Aminoacid
Time for processing: 0h 0m 4s 278ms
[user@cn3144]$ mmseqs createtaxdb uniprot_sprot tmp --tax-mapping-file uniprot_sprot_2018_03_mapping.tsv
Create directory tmp
createtaxdb uniprot_sprot tmp --tax-mapping-file uniprot_sprot_2018_03_mapping.tsv
MMseqs Version:         aabc78c298f35cbc7a4136206c1a83adaa68695f
NCBI tax dump directory
Taxonomy mapping file   uniprot_sprot_2018_03_mapping.tsv
Taxonomy mapping mode   0
Taxonomy db mode        1
Threads                 72
Verbosity               3
Download taxdump.tar.gz
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 56.1M  100 56.1M    0     0   128M      0 --:--:-- --:--:-- --:--:--  129M
Loading nodes file ... Done, got 2442778 nodes
Loading merged file ... Done, added 68544 merged nodes.
Loading names file ... Done
Init RMQ ...Done
Similarity search to transfer taxonomic information to the unknown reads
[user@cn3144]$ mmseqs createdb mystery_reads.fasta reads createdb mystery_reads.fasta reads MMseqs Version: aabc78c298f35cbc7a4136206c1a83adaa68695f Database type 0 Shuffle input database true Createdb mode 0 Write lookup file 1 Offset of numeric ids 0 Compressed 0 Verbosity 3 Converting sequences [410667] 0s 417ms Time for merging to reads_h: 0h 0m 0s 57ms Time for merging to reads: 0h 0m 0s 114ms Database type: Nucleotide Time for processing: 0h 0m 1s 111ms [user@cn3144]$ mmseqs taxonomy reads uniprot_sprot lca_result tmp -s 2 taxonomy reads uniprot_sprot lca_result tmp -s 2 MMseqs Version: aabc78c298f35cbc7a4136206c1a83adaa68695f ORF filter 1 ORF filter e-value 100 ORF filter sensitivity 2 LCA mode 3 [...snip...] [user@cn3144]$ mmseqs createtsv reads lca_result lca.tsv # convert results to readable .tsv file [user@cn3144]$ mmseqs taxonomyreport uniprot_sprot lca_result report.txt # create a summary report [user@cn3144]$ head report.txt 1.7143 278 278 no rank 0 unclassified 98.2857 15939 8 no rank 1 root 88.6909 14383 254 no rank 131567 cellular organisms 79.9346 12963 335 superkingdom 2 Bacteria 73.8793 11981 607 phylum 1224 Proteobacteria 49.3741 8007 208 class 28216 Betaproteobacteria 46.7287 7578 452 order 80840 Burkholderiales 40.6857 6598 593 family 119060 Burkholderiaceae 31.4916 5107 893 genus 48736 Ralstonia 16.0449 2602 1 species 305 Ralstonia solanacearum [user@cn3144]$ mmseqs taxonomyreport uniprot_sprot lca_result report.html --report-mode 1 # krona plot [user@cn3144]$ exit salloc.exe: Relinquishing job allocation 46116226 [user@biowulf]$
Create a batch input file (e.g. mmseqs.sh), which uses the input file 'mmseqs.in'. For example:
#!/bin/bash
module load mmseqs/2-13-45111-219-gaabc78c
cd /lscratch/$SLURM_JOB_ID
module load mmseqs
cp ${MMSEQS_TEST_DATA:-none}/* .
gunzip uniprot_sprot_2018_03_mapping.tsv.gz
mmseqs createdb uniprot_sprot_2018_03.fasta.gz uniprot_sprot
mmseqs createtaxdb uniprot_sprot tmp --tax-mapping-file uniprot_sprot_2018_03_mapping.tsv
mmseqs createdb mystery_reads.fasta reads
mmseqs taxonomy reads uniprot_sprot lca_result tmp -s 2
mmseqs createtsv reads lca_result lca.tsv
mmseqs taxonomyreport uniprot_sprot lca_result report.txt
mmseqs taxonomyreport uniprot_sprot lca_result report.html --report-mode 1
Submit this job using the Slurm sbatch command.
sbatch --cpus-per-task=2 --mem=6g mmseqs.sh