TomoTwin on Biowulf

Quick Links

TomoTwin is an application the enables particle picking for Cryo-ET using deep metric learning based procedures.

TomoTwin comes pre-trained on so far 120 different proteins. By embedding tomograms in an information-rich, high-dimensional space which separates macromolecules according to their 3-dimensional structure, TomoTwin allows users to identify proteins in tomograms de novo without manually creating training data or retraining the network each time a new protein is to be located. That means, you can simply run it for your specific sample without a much additional effort.

Reference:

Rice, G., Wagner, T., Stabrin, M. et al. TomoTwin: generalized 3D localization of macromolecules in cryo-electron tomograms with structural data mining Nat Methods (2023)

Documentation

Important Notes

This application uses the napari application to visualize the tomograms. napari requires a graphical connection using NX

Module Name: tomotwin (see the modules page for more information)
GPU enabled
Environment variables set
- TOMOTWIN_TEST_DATA
- TOMOTWIN_MODEL
Example files in $TOMOTWIN_TEST_DATA
Model file in $TOMOTWIN_MODEL

Interactive job

Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program.
Sample session (user input in bold):

[user@biowulf]$ sinteractive --mem=16G --gres=lscratch:50,gpu:v100x:2
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144 ~]$ module load tomotwin

[user@cn3144 ~]$ cd /lscratch/$SLURM_JOB_ID

[user@cn3144 46116226]$ cp -r $TOMOTWIN_TEST_DATA/* .

[user@cn3144 46116226]$ CUDA_VISIBLE_DEVICES=0,1 tomotwin_embed.py tomogram \
  -m tomotwin_model_p120_052022_loss.pth \
  -v tomo/tomo.mrc \
  -o out/embed/tomo/ \
  -b 400
Latest version of TomoTwin is installed :-)
reading tomotwin_model_p120_052022_loss.pth
Model config:
{'identifier': 'SiameseNet', 'network_config': ...}
... UserWarning: This DataLoader will create 12 worker processes in total. Our suggested max number of worker in current system is 4 ...
Embeddings have shape: (5083356, 35)
Wrote embeddings to disk to out/embed/tomo/tomo_embeddings.temb
Done.

[user@cn3144 46116226]$ tomotwin_tools.py extractref \
  --tomo tomo/tomo.mrc \
  --coords ref.coords \
  --out out/extracted_ref/
1it [00:00, 97.91it/s]
wrote subvolume reference to out/extracted_ref/

[user@cn3144 46116226]$ CUDA_VISIBLE_DEVICES=0,1 tomotwin_embed.py subvolumes \
  -m tomotwin_model_p120_052022_loss.pth \
  -v out/extracted_ref/reference_0.mrc \
  -o out/embed/ref
Latest version of TomoTwin is installed :-)
reading tomotwin_model_p120_052022_loss.pth
Model config:
{'identifier': 'SiameseNet', 'network_config': ...}
UserWarning: This DataLoader will create 12 worker processes in total. Our suggested max number of worker in current system is 4 ...
Done. Wrote results to out/embed/ref/embeddings.temb

[user@cn3144 46116226]$ tomotwin_map.py distance \
  -r out/embed/ref/embeddings.temb \
  -v out/embed/tomo/tomo_embeddings.temb \
  -o out/map/
Latest version of TomoTwin is installed :-)
Read embeddings
Map references: 100%|████████████████████████████████████| 1/1 [00:04<00:00,  4.55s/it]
Prepare output...
Wrote output to out/map/map.tmap

[user@cn3144 46116226]$ tomotwin_locate.py findmax -m out/map/map.tmap -o out/locate/
Latest version of TomoTwin is installed :-)
start locate  reference_0.mrc
effective global min: 0.5
Locate class 0: 100%|███████████████████████████████| 31071/31071 [00:06<00:00, 4604.62it/s]
Call get_avg_pos
done 0
Located reference_0.mrc 847
Non-maximum-supression: 100%|███████████████████████| 847/847 [00:00<00:00, 3489.54it/s]
Particles of class reference_0.mrc: 844 (before NMS: 847)

[user@cn3144 46116226]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$

Batch job

Most jobs should be run as batch jobs.

Create a batch input file (e.g. tomotwin.sh). For example:

#!/bin/bash
set -e
cd /lscratch/$SLURM_JOB_ID
module load tomotwin
cp -r $TOMOTWIN_TEST_DATA/* .

CUDA_VISIBLE_DEVICES=0,1 tomotwin_embed.py tomogram \
    -m tomotwin_model_p120_052022_loss.pth \
    -v tomo/tomo.mrc \
    -o out/embed/tomo/ \
    -b 400
tomotwin_tools.py extractref \
    --tomo tomo/tomo.mrc \
    --coords ref.coords \
    --out out/extracted_ref/
CUDA_VISIBLE_DEVICES=0,1 tomotwin_embed.py subvolumes \
    -m tomotwin_model_p120_052022_loss.pth \
    -v out/extracted_ref/reference_0.mrc \
    -o out/embed/ref
tomotwin_map.py distance \
    -r out/embed/ref/embeddings.temb \
    -v out/embed/tomo/tomo_embeddings.temb \
    -o out/map/
tomotwin_locate.py findmax -m out/map/map.tmap -o out/locate/

Submit this job using the Slurm sbatch command.

sbatch [--cpus-per-task=#] [--mem=#] [--gres=lscratch:#,gpu:type:2] tomotwin.sh

Swarm of Jobs

A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.

Create a swarmfile (e.g. tomotwin.swarm). For example:

CUDA_VISIBLE_DEVICES=0,1 tomotwin_embed.py tomogram -m model.pth -v tomo1.mrc -o out1 -b 400
CUDA_VISIBLE_DEVICES=0,1 tomotwin_embed.py tomogram -m model.pth -v tomo2.mrc -o out2 -b 400
CUDA_VISIBLE_DEVICES=0,1 tomotwin_embed.py tomogram -m model.pth -v tomo3.mrc -o out3 -b 400
CUDA_VISIBLE_DEVICES=0,1 tomotwin_embed.py tomogram -m model.pth -v tomo4.mrc -o out4 -b 400

Submit this job using the swarm command.

swarm -f tomotwin.swarm [-g #] [--gres=gpu:type:2] --module tomotwin

where

`-g #`	Number of Gigabytes of memory required for each process (1 line in the swarm command file)
`--gres=gpu:type:2`	2 GPUs required for each process (1 line in the swarm command file). Replace type with GPU types available like v100, v100x, p100, a100, etc.
`--module tomotwin`	Loads the tomotwin module for each subjob in the swarm