This module provides the batch scripts of the ColabFold implementation of alphafold.
$COLABFOLD_TEST_DATA
/fdb/colabfold/
Allocate an interactive session for the generation of the multiple sequence applications (MSAs). This session requires at least 128GB of memory. Note that colabfold_search is optimized for running many query sequences in a single job. Never run single sequences in the search part of the analysis. It is inefficient and if you run many will strain the file system. For example: Below we will create alignments for 17 polymerase related proteins of the mimivirus genome which takes about 100 minutes. Creating MSAs for all 979 proteins of the mimivirus genome take takes about 285 minutes or only 2.8x longer for 50x more proteins. Note that colabfold_search treats each protein in a fasta file as a separate monomer.
[user@biowulf]$ sinteractive --mem=128G --cpus-per-task=16 --gres=lscratch:100 salloc.exe: Pending job allocation 46116226 salloc.exe: job 46116226 queued and waiting for resources salloc.exe: job 46116226 has been allocated resources salloc.exe: Granted job allocation 46116226 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn3144 are ready for job [user@cn3144]$ module load colabfold [user@cn3144]$ colabfold_search --threads $SLURM_CPUS_PER_TASK \ $COLABFOLD_TEST_DATA/mimi_poly.fa $COLABFOLD_DB mimi_poly [...much output...] [user@cn3144]$ ls -lh mimi_poly total 66M -rw-r--r-- 1 user group 17M Sep 22 16:14 0.a3m -rw-r--r-- 1 user group 2.7M Sep 22 16:14 10.a3m -rw-r--r-- 1 user group 92K Sep 22 16:14 11.a3m -rw-r--r-- 1 user group 296K Sep 22 16:14 12.a3m -rw-r--r-- 1 user group 33K Sep 22 16:14 13.a3m -rw-r--r-- 1 user group 6.4M Sep 22 16:14 14.a3m -rw-r--r-- 1 user group 42K Sep 22 16:14 15.a3m -rw-r--r-- 1 user group 7.1K Sep 22 16:14 16.a3m -rw-r--r-- 1 user group 23M Sep 22 16:14 1.a3m -rw-r--r-- 1 user group 12M Sep 22 16:14 2.a3m -rw-r--r-- 1 user group 797K Sep 22 16:14 3.a3m -rw-r--r-- 1 user group 200K Sep 22 16:14 4.a3m -rw-r--r-- 1 user group 580K Sep 22 16:14 5.a3m -rw-r--r-- 1 user group 273K Sep 22 16:14 6.a3m -rw-r--r-- 1 user group 160K Sep 22 16:14 7.a3m -rw-r--r-- 1 user group 464K Sep 22 16:14 8.a3m -rw-r--r-- 1 user group 2.6M Sep 22 16:14 9.a3m [user@cn3144]$ exit
Now we can build the structure predictions for the 17 proteins above. This step requires GPU. Each of the 17 proteins in this example is modeled as a monomer. See below for multimer predictions. With the default settings on a A100 model generation for the 17 proteins takes approximately 4h
[user@biowulf]$ sinteractive --mem=48G --cpus-per-task=8 --gres=lscratch:100,gpu:a100:1 salloc.exe: Pending job allocation 46116227 salloc.exe: job 46116227 queued and waiting for resources salloc.exe: job 46116227 has been allocated resources salloc.exe: Granted job allocation 46116227 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn3144 are ready for job [user@cn2144]$ module load colabfold [user@cn2144]$ colabfold_batch --amber --use-gpu-relax \ mimi_poly mimi_poly_models 2022-09-26 09:54:12,026 Running colabfold 1.3.0 (22671664ac2c9dcb30086c3e654414d950ccb297) 2022-09-26 09:54:19,458 Found 6 citations for tools or databases 2022-09-26 09:54:24,671 Query 1/17: 12 (length 73) 2022-09-26 09:54:24,720 Running model_3 2022-09-26 09:54:50,065 model_3 took 23.3s (3 recycles) with pLDDT 86.1 and ptmscore 0.622 2022-09-26 09:55:15,397 Relaxation took 15.9s 2022-09-26 09:55:15,400 Running model_4 2022-09-26 09:55:24,884 model_4 took 8.6s (3 recycles) with pLDDT 89 and ptmscore 0.668 2022-09-26 09:55:31,969 Relaxation took 5.7s [...snip...] [user@cn2144]$ ls -lh mimi_poly_models | head total 570M -rw-r--r-- 1 user group 17M Sep 26 15:36 0.a3m -rw-r--r-- 1 user group 142K Sep 26 16:19 0_coverage.png -rw-r--r-- 1 user group 0 Sep 26 16:19 0.done.txt -rw-r--r-- 1 user group 743K Sep 26 16:19 0_PAE.png -rw-r--r-- 1 user group 196K Sep 26 16:19 0_plddt.png -rw-r--r-- 1 user group 18M Sep 26 16:19 0_predicted_aligned_error_v1.json -rw-r--r-- 1 user group 1.5M Sep 26 16:19 0_relaxed_rank_1_model_3.pdb -rw-r--r-- 1 user group 1.5M Sep 26 16:19 0_relaxed_rank_2_model_4.pdb -rw-r--r-- 1 user group 1.5M Sep 26 16:19 0_relaxed_rank_3_model_2.pdb [user@cn2144]$ exit salloc.exe: Relinquishing job allocation 46116226 [user@biowulf]$
For each protein the output includes some diagnostic plots, the predicted alignment error of the ranked models in json, as well as the ranked relaxed and unrelaxed models in PDB format. As an example below see the predicted best ranked model (rank 1) for YP_003986740 (DNA dependent RNA polymerase subunit B) along with some of the diagnostic plots.
0_relaxed_rank_1_model_3.pdb
) of protein
YP_003986740 colored by pLDDT. (B) pLDDT for all 5 models.
(C) Coverage of YP_003986740 in the MSA generated by colabfold_search
with mmseqs2. (D) Predicted aligned error (PAE) for all 5 models.
Multimers are predicted by concatenating all sequences of a multimer separated by ':' into a single sequence
Create a batch input file (e.g. colabfold-search.sh) to create the MSAs:
#!/bin/bash module load colabfold colabfold_search --threads $SLURM_CPUS_PER_TASK \ $COLABFOLD_TEST_DATA/mimi_poly.fa $COLABFOLD_DB mimi_poly
Submit this job using the Slurm sbatch command.
sbatch --cpus-per-task=16 --mem=128g --gres=lscratch:100 colabfold-search.sh
Then build the corresponding models
#!/bin/bash module load colabfold colabfold_batch --amber --use-gpu-relax mimi_poly mimi_poly_models
Submit this job using the Slurm sbatch command.
sbatch --cpus-per-task=8 --mem=48g --gres=lscratch:100,gpu:a100:1 --partition=gpu colabfold-batch.sh