shmlast: an improved implementation of Conditional Reciprocal Best Hits with LAST and Python
shmlast is a reimplementation of the Conditional Reciprocal Best Hits algorithm for finding potential orthologs between a transcriptome and a species-specific protein database. It uses the LAST aligner and the pydata stack to achieve much better performance while staying in the Python ecosystem.
References:
- Bo Li, Nathanael Fillmore, Yongsheng Bai, Mike Collins, James A. Thomson, Ron Stewart, and Colin N. Dewey.
Evaluation of de novo transcriptome assemblies from RNA-Seq data
Genome Biology 2014, 15:553
Documentation
Important Notes
- Module Name: shmlast (see the modules page for more information)
- Unusual environment variables set
- SHMLAST_HOME installation directory
- SHMLAST_BIN executable directory
- SHMLAST_DATA sample data directory
Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.
Allocate an interactive session and run the program. Sample session:
[user@biowulf]$ sinteractive --mem=4g [user@@cn3200 ~]$module load shmlast [+] Loading singularity 3.10.0 on cn1113 [+] Loading shmlast 1.6 [user@biowulf]$ shmlast usage: shmlast [-h] [--version] {rbl,crbl} ... shmlast is a reimplementation of the Conditional Reciprocal Best Hits algorithm for finding potential orthologs between a transcriptome and a species-specific protein database. It uses the LAST aligner and the pydata stack to achieve much better performance while staying in the Python ecosystem. positional arguments: {rbl,crbl} optional arguments: -h, --help show this help message and exit --version show program's version number and exit [user@biowulf]$End the interactive session:shmlast 1.2.1 -- Camille Scott, 2016 ------------------------------------ subcommand: Conditional Reciprocal Best LAST doit action: run --- Begin Task Execution --- . rename:/usr/local/apps/shmlast/1.2.1/sample_data/test-transcript.fa: * Python: rename_input . rename:/usr/local/apps/shmlast/1.2.1/sample_data/test-protein.fa: * Python: rename_input . translate:.test-transcript.fa: * Python: function translate_fastx . lastdb:.test-transcript.fa.pep: * Cmd: `/usr/local/apps/shmlast/1.2.1/bin/lastdb -p -w3 .test-transcript.fa.pep .test-transcript.fa.pep` . lastdb:.test-protein.fa: * Cmd: `/usr/local/apps/shmlast/1.2.1/bin/lastdb -p -w3 .test-protein.fa .test-protein.fa` . lastal:.test-protein.fa.x.test-transcript.fa.pep.maf: * Cmd: `cat .test-protein.fa | /usr/local/apps/parallel/20171222/bin/parallel --round-robin --pipe -L 2 -N 10000 --gnu -j 1 -a .test-protein.fa /usr/local/apps/shmlast/1.2.1/bin/lastal -D100000.0 .test-transcript.fa.pep > .test-protein.fa.x.test-transcript.fa.pep.maf` . lastal:.test-transcript.fa.pep.x.test-protein.fa.maf: * Cmd: `cat .test-transcript.fa.pep | /usr/local/apps/parallel/20171222/bin/parallel --round-robin --pipe -L 2 -N 10000 --gnu -j 1 -a .test-transcript.fa.pep /usr/local/apps/shmlast/1.2.1/bin/lastal -D100000.0 .test-protein.fa > .test-transcript.fa.pep.x.test-protein.fa.maf` . fit_and_filter_crbl_hits: * Python: do_crbl_fit_and_filter
[user@cn3200 ~]$ exit salloc.exe: Relinquishing job allocation 46116226 [user@biowulf ~]$
Batch job
Most jobs should be run as batch jobs.
Create a batch input file (e.g. shmlast.sh). For example:
#!/bin/bash #SBATCH --mem=4g module load shmlast cd /data/$USER shmlast crbl -q $SHMLAST_DATA/test-transcript.fa -d $SHMLAST_DATA/test-protein.fa
Submit this job using the Slurm sbatch command.
sbatch shmlast.sh
Swarm of Jobs
A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.
Create a swarmfile (e.g. shmlast.swarm). For example:
#!/bin/bash module load shmlast cd /data/$USER shmlast crbl -q $SHMLAST_DATA/test-transcript.fa -d $SHMLAST_DATA/test-protein.fa
Submit this job using the swarm command.
swarm -f shmlast.swarm -g 4