From the Documentation:
MMseqs2 (Many-against-Many sequence searching) is a software suite to search and cluster huge protein and nucleotide sequence sets. MMseqs2 is open source GPL-licensed software implemented in C++ for Linux, MacOS, and (as beta version, via cygwin) Windows. The software is designed to run on multiple cores and servers and exhibits very good scalability. MMseqs2 can run 10000 times faster than BLAST. At 100 times its speed it achieves almost the same sensitivity. It can perform profile searches with the same sensitivity as PSI-BLAST at over 400 times its speed.
$MMSEQS_TEST_DATA
Allocate an interactive session to follow the mmseqs metagenomic pathogen detection tutorial. Note that for larger databases you would want more memory.
[user@biowulf]$ sinteractive --gres=lscratch:10 -c2 --mem=6g salloc.exe: Pending job allocation 46116226 salloc.exe: job 46116226 queued and waiting for resources salloc.exe: job 46116226 has been allocated resources salloc.exe: Granted job allocation 46116226 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn3144 are ready for job [user@cn3144]$ cd /lscratch/$SLURM_JOB_ID [user@cn3144]$ module load mmseqs [user@cn3144]$ cp ${MMSEQS_TEST_DATA:-none}/* .
Create a taxonomically annotated sequence database from a subset of uniprot
[user@cn3144]$ gunzip uniprot_sprot_2018_03_mapping.tsv.gz [user@cn3144]$ mmseqs createdb uniprot_sprot_2018_03.fasta.gz uniprot_sprot createdb uniprot_sprot_2018_03.fasta.gz uniprot_sprot MMseqs Version: aabc78c298f35cbc7a4136206c1a83adaa68695f Database type 0 Shuffle input database true Createdb mode 0 Write lookup file 1 Offset of numeric ids 0 Compressed 0 Verbosity 3 Converting sequences [556915] 2s 838ms Time for merging to uniprot_sprot_h: 0h 0m 0s 174ms Time for merging to uniprot_sprot: 0h 0m 0s 358ms Database type: Aminoacid Time for processing: 0h 0m 4s 278ms [user@cn3144]$ mmseqs createtaxdb uniprot_sprot tmp --tax-mapping-file uniprot_sprot_2018_03_mapping.tsv Create directory tmp createtaxdb uniprot_sprot tmp --tax-mapping-file uniprot_sprot_2018_03_mapping.tsv MMseqs Version: aabc78c298f35cbc7a4136206c1a83adaa68695f NCBI tax dump directory Taxonomy mapping file uniprot_sprot_2018_03_mapping.tsv Taxonomy mapping mode 0 Taxonomy db mode 1 Threads 72 Verbosity 3 Download taxdump.tar.gz % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 56.1M 100 56.1M 0 0 128M 0 --:--:-- --:--:-- --:--:-- 129M Loading nodes file ... Done, got 2442778 nodes Loading merged file ... Done, added 68544 merged nodes. Loading names file ... Done Init RMQ ...Done
Similarity search to transfer taxonomic information to the unknown reads
[user@cn3144]$ mmseqs createdb mystery_reads.fasta reads createdb mystery_reads.fasta reads MMseqs Version: aabc78c298f35cbc7a4136206c1a83adaa68695f Database type 0 Shuffle input database true Createdb mode 0 Write lookup file 1 Offset of numeric ids 0 Compressed 0 Verbosity 3 Converting sequences [410667] 0s 417ms Time for merging to reads_h: 0h 0m 0s 57ms Time for merging to reads: 0h 0m 0s 114ms Database type: Nucleotide Time for processing: 0h 0m 1s 111ms [user@cn3144]$ mmseqs taxonomy reads uniprot_sprot lca_result tmp -s 2 taxonomy reads uniprot_sprot lca_result tmp -s 2 MMseqs Version: aabc78c298f35cbc7a4136206c1a83adaa68695f ORF filter 1 ORF filter e-value 100 ORF filter sensitivity 2 LCA mode 3 [...snip...] [user@cn3144]$ mmseqs createtsv reads lca_result lca.tsv # convert results to readable .tsv file [user@cn3144]$ mmseqs taxonomyreport uniprot_sprot lca_result report.txt # create a summary report [user@cn3144]$ head report.txt 1.7143 278 278 no rank 0 unclassified 98.2857 15939 8 no rank 1 root 88.6909 14383 254 no rank 131567 cellular organisms 79.9346 12963 335 superkingdom 2 Bacteria 73.8793 11981 607 phylum 1224 Proteobacteria 49.3741 8007 208 class 28216 Betaproteobacteria 46.7287 7578 452 order 80840 Burkholderiales 40.6857 6598 593 family 119060 Burkholderiaceae 31.4916 5107 893 genus 48736 Ralstonia 16.0449 2602 1 species 305 Ralstonia solanacearum [user@cn3144]$ mmseqs taxonomyreport uniprot_sprot lca_result report.html --report-mode 1 # krona plot [user@cn3144]$ exit salloc.exe: Relinquishing job allocation 46116226 [user@biowulf]$
Create a batch input file (e.g. mmseqs.sh), which uses the input file 'mmseqs.in'. For example:
#!/bin/bash module load mmseqs/2-13-45111-219-gaabc78c cd /lscratch/$SLURM_JOB_ID module load mmseqs cp ${MMSEQS_TEST_DATA:-none}/* . gunzip uniprot_sprot_2018_03_mapping.tsv.gz mmseqs createdb uniprot_sprot_2018_03.fasta.gz uniprot_sprot mmseqs createtaxdb uniprot_sprot tmp --tax-mapping-file uniprot_sprot_2018_03_mapping.tsv mmseqs createdb mystery_reads.fasta reads mmseqs taxonomy reads uniprot_sprot lca_result tmp -s 2 mmseqs createtsv reads lca_result lca.tsv mmseqs taxonomyreport uniprot_sprot lca_result report.txt mmseqs taxonomyreport uniprot_sprot lca_result report.html --report-mode 1
Submit this job using the Slurm sbatch command.
sbatch --cpus-per-task=2 --mem=6g mmseqs.sh