DeepLoc2 on Biowulf
DeepLoc2 uses deep learning to predict subcellular localization of eukaryotic proteins.
DeepLoc 2.0 predicts the subcellular localization(s) of eukaryotic proteins. DeepLoc 2.0 is a multi-label predictor, which means that is able to predict one or more localizations for any given protein. It can differentiate between 10 different localizations: Nucleus, Cytoplasm, Extracellular, Mitochondrion, Cell membrane, Endoplasmic reticulum, Chloroplast, Golgi apparatus, Lysosome/Vacuole and Peroxisome. Additionally, DeepLoc 2.0 can predict the presence of the sorting signal(s) that had an influence on the prediction of the subcellular localization(s).
References:
- Vineet Thumuluri, Jose Juan Almagro Armenteros, Alexander Rosenberg Johansen, Henrik Nielsen, Ole Winther. DeepLoc 2.0: multi-label subcellular localization prediction using protein language models. Nucleic Acids Research, Web server issue 2022.
Documentation
Important Notes
- Module Name: deeploc (see the modules page for more information)
- singlethreaded
- Environment variables set
- DEEPLOC_TRAIN_DATA
- DEEPLOC_TEST_DATA
- Example files in $DEEPLOC_TEST_DATA or /usr/local/apps/deeploc/TEST_DATA
- The first time you run deeploc2, it will download checkpoint data. Due to the size of the checkpoint files, deeploc2 on Biowulf is configured to download this to /data/$USER/.cache/torch/hub/checkpoints. Users can also copy the files from $DEEPLOC_TRAIN_DATA/checkpoints before running deeploc2 to skip this step. However, for this to work you can only copy to /data/$USER/.cache/torch/hub/checkpoints.
- Memory Considerations: We recommend at least 24G memory for your runs. Please run test jobs to benchmark your dataset.
Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.
Allocate an interactive session and run the program on the test data. Then compare it with the test data result using diff.
Sample session (user input in bold):
[user@biowulf]$ sinteractive --mem=24G --gres=lscratch:5 salloc.exe: Pending job allocation 46116226 salloc.exe: job 46116226 queued and waiting for resources salloc.exe: job 46116226 has been allocated resources salloc.exe: Granted job allocation 46116226 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn3144 are ready for job [user@cn3144 ~]$ module load deeploc [user@cn3144 ~]$ mkdir -p /data/$USER/.cache/torch/hub [user@cn3144 ~]$ cp -r $DEEPLOC_TRAIN_DATA/checkpoints /data/$USER/.cache/torch/hub/ [user@cn3144 ~]$ cd /lscratch/$SLURM_JOB_ID [user@cn3144 ~]$ cp $DEEPLOC_TEST_DATA/test.fasta . [user@cn3144 ~]$ deeploc2 -f test.fasta [user@cn3144 ~]$ diff outputs/results_20230101-000000.csv $DEEPLOC_TEST_DATA/results_test.csv [user@cn3144 ~]$ exit salloc.exe: Relinquishing job allocation 46116226 [user@biowulf ~]$
Batch job
Most jobs should be run as batch jobs.
Create a batch input file (e.g. deeploc.sh). For example:
#!/bin/bash set -e module load deeploc cd /data/$USER deeploc2 -f input.fasta
Submit this job using the Slurm sbatch command.
sbatch [--mem=#] deeploc.sh
Swarm of Jobs
A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.
Create a swarmfile (e.g. deeploc.swarm). For example:
deeploc2 -f 01.fasta -o results_01 deeploc2 -f 02.fasta -o results_02 deeploc2 -f 03.fasta -o results_03 deeploc2 -f 04.fasta -o results_04
Submit this job using the swarm command.
swarm -f deeploc.swarm [-g #] --module deeplocwhere
-g # | Number of Gigabytes of memory required for each process (1 line in the swarm command file) |
--module deeploc | Loads the deeploc module for each subjob in the swarm |