ProteinMPNN: robust deep learning–based protein sequence design
ProteinMPNN is a deep learning–based protein sequence design method. Unlike AlphaFold and Rosettafold, which both predict protein structures from sequence, ProteinMPNN tries to solve the inverse problem, to find a sequence that matches a protein backbone.
References:
- J.Dauparas, I.Anishchenko, N.Bennett, H.Bai, R.J.Ragotte, L.F.Milles, B.I.M.Wicky,
A.Courbet, R.J.de Haas, N.Bethel, P.J.Y.Leung, T.F.Huddy, S.Pellock, D.Tischer,
F.Chan, B. Koepnick, H. Nguyen, A. Kang, B. Sankaran, A.K.Bera, N.P.King, D. Baker
Robust deep learning–based protein sequence design using ProteinMPNN
Science 378, 49–56 (2022) .
Documentation
- ProteinMPNN Github page
- ProteinMPNN tutorial
- A Step-by-Step Guide to Deploying a Protein Binder Design pipeline
- dl_binder_design tutorial
Important Notes
- Module Name: ProteinMPNN (see the modules page for more information)
- Unusual environment variables set
- PROTEINMPNN_HOME installation directory
- PROTEINMPNN_BIN executable directory
- PROTEINMPNN_SRC source code directory
- PROTEINMPNN_DATA sample data directory
Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.
Allocate an interactive session and run the program. Sample session:
[user@biowulf]$ interactive --gres=gpu:a100:1,lscratch:10 --mem=24g -c8 [user@cn0071 ~]$ module load ProteinMPNN [+] Loading singularity 4.1.5 on cn0071 [+] Loading ProteinMPNN 1.0.1 [user@cn0071 ~]$ protein_mpnn_run.py -h usage: protein_mpnn_run.py [-h] [--ca_only] [--path_to_model_weights PATH_TO_MODEL_WEIGHTS] [--model_name MODEL_NAME] [--seed SEED] [--save_score SAVE_SCORE] [--save_probs SAVE_PROBS] [--score_only SCORE_ONLY] [--conditional_probs_only CONDITIONAL_PROBS_ONLY] [--conditional_probs_only_backbone CONDITIONAL_PROBS_ONLY_BACKBONE] [--unconditional_probs_only UNCONDITIONAL_PROBS_ONLY] [--backbone_noise BACKBONE_NOISE] [--num_seq_per_target NUM_SEQ_PER_TARGET] [--batch_size BATCH_SIZE] [--max_length MAX_LENGTH] [--sampling_temp SAMPLING_TEMP] [--out_folder OUT_FOLDER] [--pdb_path PDB_PATH] [--pdb_path_chains PDB_PATH_CHAINS] [--jsonl_path JSONL_PATH] [--chain_id_jsonl CHAIN_ID_JSONL] [--fixed_positions_jsonl FIXED_POSITIONS_JSONL] [--omit_AAs OMIT_AAS] [--bias_AA_jsonl BIAS_AA_JSONL] [--bias_by_res_jsonl BIAS_BY_RES_JSONL] [--omit_AA_jsonl OMIT_AA_JSONL] [--pssm_jsonl PSSM_JSONL] [--pssm_multi PSSM_MULTI] [--pssm_threshold PSSM_THRESHOLD] [--pssm_log_odds_flag PSSM_LOG_ODDS_FLAG] [--pssm_bias_flag PSSM_BIAS_FLAG] [--tied_positions_jsonl TIED_POSITIONS_JSONL] options: -h, --help show this help message and exit --ca_only Parse CA-only structures and use CA-only models (default: false) (default: False) --path_to_model_weights PATH_TO_MODEL_WEIGHTS Path to model weights folder; (default: ) --model_name MODEL_NAME ProteinMPNN model name: v_48_002, v_48_010, v_48_020, v_48_030; v_48_010=version with 48 edges 0.10A noise (default: v_48_020) --seed SEED If set to 0 then a random seed will be picked; (default: 0) --save_score SAVE_SCORE 0 for False, 1 for True; save score=-log_prob to npy files (default: 0) --save_probs SAVE_PROBS 0 for False, 1 for True; save MPNN predicted probabilites per position (default: 0) --score_only SCORE_ONLY 0 for False, 1 for True; score input backbone-sequence pairs (default: 0) --conditional_probs_only CONDITIONAL_PROBS_ONLY 0 for False, 1 for True; output conditional probabilities p(s_i given the rest of the sequence and backbone) (default: 0) --conditional_probs_only_backbone CONDITIONAL_PROBS_ONLY_BACKBONE 0 for False, 1 for True; if true output conditional probabilities p(s_i given backbone) (default: 0) --unconditional_probs_only UNCONDITIONAL_PROBS_ONLY 0 for False, 1 for True; output unconditional probabilities p(s_i given backbone) in one forward pass (default: 0) --backbone_noise BACKBONE_NOISE Standard deviation of Gaussian noise to add to backbone atoms (default: 0.0) --num_seq_per_target NUM_SEQ_PER_TARGET Number of sequences to generate per target (default: 1) --batch_size BATCH_SIZE Batch size; can set higher for titan, quadro GPUs, reduce this if running out of GPU memory (default: 1) --max_length MAX_LENGTH Max sequence length (default: 200000) --sampling_temp SAMPLING_TEMP A string of temperatures, 0.2 0.25 0.5. Sampling temperature for amino acids. Suggested values 0.1, 0.15, 0.2, 0.25, 0.3. Higher values will lead to more diversity. (default: 0.1) --out_folder OUT_FOLDER Path to a folder to output sequences, e.g. /home/out/ (default: None) --pdb_path PDB_PATH Path to a single PDB to be designed (default: ) --pdb_path_chains PDB_PATH_CHAINS Define which chains need to be designed for a single PDB (default: ) --jsonl_path JSONL_PATH Path to a folder with parsed pdb into jsonl (default: None) --chain_id_jsonl CHAIN_ID_JSONL Path to a dictionary specifying which chains need to be designed and which ones are fixed, if not specied all chains will be designed. (default: ) --fixed_positions_jsonl FIXED_POSITIONS_JSONL Path to a dictionary with fixed positions (default: ) --omit_AAs OMIT_AAS Specify which amino acids should be omitted in the generated sequence, e.g. 'AC' would omit alanine and cystine. (default: X) --bias_AA_jsonl BIAS_AA_JSONL Path to a dictionary which specifies AA composion bias if neededi, e.g. {A: -1.1, F: 0.7} would make A less likely and F more likely. (default: ) --bias_by_res_jsonl BIAS_BY_RES_JSONL Path to dictionary with per position bias. (default: ) --omit_AA_jsonl OMIT_AA_JSONL Path to a dictionary which specifies which amino acids need to be omited from design at specifi chain indices (default: ) --pssm_jsonl PSSM_JSONL Path to a dictionary with pssm (default: ) --pssm_multi PSSM_MULTI A value between [0.0, 1.0], 0.0 means do not use pssm, 1.0 ignore MPNN predictions (default: 0.0) --pssm_threshold PSSM_THRESHOLD A value between -inf + inf to restric per position AAs (default: 0.0) --pssm_log_odds_flag PSSM_LOG_ODDS_FLAG 0 for False, 1 for True (default: 0) --pssm_bias_flag PSSM_BIAS_FLAG 0 for False, 1 for True (default: 0) --tied_positions_jsonl TIED_POSITIONS_JSONL Path to a dictionary with tied positions (default: ) [user@cn0071 ~]$ git clone https://github.com/dauparas/ProteinMPNN [user@cn0071 ~]$ protein_mpnn_run.py \ --pdb_path ProteinMPNN/inputs/PDB_complexes/pdbs/3HTN.pdb \ --out_folder ./3HTN_designs/ \ --num_seq_per_target 10 \ --sampling_temp "0.1" \ --seed 0 \ --batch_size 1 \ --model_name v_48_020 ---------------------------------------- chain_id_jsonl is NOT loaded ---------------------------------------- fixed_positions_jsonl is NOT loaded ---------------------------------------- pssm_jsonl is NOT loaded ---------------------------------------- omit_AA_jsonl is NOT loaded ---------------------------------------- bias_AA_jsonl is NOT loaded ---------------------------------------- tied_positions_jsonl is NOT loaded ---------------------------------------- bias by residue dictionary is not loaded, or not provided ---------------------------------------- ---------------------------------------- Number of edges: 48 Training noise level: 0.2A Generating sequences for: 3HTN 10 sequences of length 429 generated in 9.4706 seconds [user@cn0071 ~]$ dl_interface_design.py -h ┌──────────────────────────────────────────────────────────────────────────────┐ │ PyRosetta-4 │ │ Created in JHU by Sergey Lyskov and PyRosetta Team │ │ (C) Copyright Rosetta Commons Member Institutions │ │ │ │ NOTE: USE OF PyRosetta FOR COMMERCIAL PURPOSES REQUIRE PURCHASE OF A LICENSE │ │ See LICENSE.PyRosetta.md or email license@uw.edu for details │ └──────────────────────────────────────────────────────────────────────────────┘ PyRosetta-4 2024 [Rosetta PyRosetta4.Release.python39.linux 2024.39+release.59628fbc5bc09f1221e1642f1f8d157ce49b1410 2024-09-23T07:49:48] retrieved from: http://www.pyrosetta.org usage: dl_interface_design.py [-h] [-pdbdir PDBDIR] [-silent SILENT] [-outpdbdir OUTPDBDIR] [-outsilent OUTSILENT] [-runlist RUNLIST] [-checkpoint_name CHECKPOINT_NAME] [-debug] [-relax_cycles RELAX_CYCLES] [-output_intermediates] [-seqs_per_struct SEQS_PER_STRUCT] [-checkpoint_path CHECKPOINT_PATH] [-temperature TEMPERATURE] [-augment_eps AUGMENT_EPS] [-protein_features PROTEIN_FEATURES] [-omit_AAs OMIT_AAS] [-bias_AA_jsonl BIAS_AA_JSONL] [-num_connections NUM_CONNECTIONS] optional arguments: -h, --help show this help message and exit -pdbdir PDBDIR The name of a directory of pdbs to run through the model -silent SILENT The name of a silent file to run through the model -outpdbdir OUTPDBDIR The directory to which the output PDB files will be written, used if the -pdbdir arg is active -outsilent OUTSILENT The name of the silent file to which output structs will be written, used if the -silent arg is active -runlist RUNLIST The path of a list of pdb tags to run, only active when the -pdbdir arg is active (default: ''; Run all PDBs) -checkpoint_name CHECKPOINT_NAME The name of a file where tags which have finished will be written (default: check.point) -debug When active, errors will cause the script to crash and the error message to be printed out (default: False) -relax_cycles RELAX_CYCLES The number of relax cycles to perform on each structure (default: 1) -output_intermediates Whether to write all intermediate sequences from the relax cycles to disk (default: False) -seqs_per_struct SEQS_PER_STRUCT The number of sequences to generate for each structure (default: 1) -checkpoint_path CHECKPOINT_PATH The path to the ProteinMPNN weights you wish to use, default /opt/dl_binder_design/mpnn_fr/ProteinMPNN/vanilla_model_weights/v_48_020.pt -temperature TEMPERATURE The sampling temperature to use when running ProteinMPNN (default: 0.000001) -augment_eps AUGMENT_EPS The variance of random noise to add to the atomic coordinates (default 0) -protein_features PROTEIN_FEATURES What type of protein features to input to ProteinMPNN (default: full) -omit_AAs OMIT_AAS A string of all residue types (one letter case-insensitive) that you would not like to use for design. Letters not corresponding to residue types will be ignored (default: CX) -bias_AA_jsonl BIAS_AA_JSONL The path to a JSON file containing a dictionary mapping residue one-letter names to the bias for that residue eg. {A: -1.1, F: 0.7} (default: ; no bias) -num_connections NUM_CONNECTIONS Number of neighbors each residue is connected to. Do not mess around with this argument unless you have a specific set of ProteinMPNN weights which expects a different number of connections. (default: 48) [user@cn0071 ~]$ dl_interface_design.py -silent ./dl_binder_design/examples/inputs/proteinmpnn_output.silent ┌──────────────────────────────────────────────────────────────────────────────┐ │ PyRosetta-4 │ │ Created in JHU by Sergey Lyskov and PyRosetta Team │ │ (C) Copyright Rosetta Commons Member Institutions │ │ │ │ NOTE: USE OF PyRosetta FOR COMMERCIAL PURPOSES REQUIRE PURCHASE OF A LICENSE │ │ See LICENSE.PyRosetta.md or email license@uw.edu for details │ └──────────────────────────────────────────────────────────────────────────────┘ PyRosetta-4 2024 [Rosetta PyRosetta4.Release.python39.linux 2024.39+release.59628fbc5bc09f1221e1642f1f8d157ce49b1410 2024-09-23T07:49:48] retrieved from: http://www.pyrosetta.org Found GPU will run ProteinMPNN on GPU Attempting pose: design_ppi_0_dldesign_0 ProteinMPNN generated 1 sequences in 4 seconds Running FastRelax Completed one cycle of FastRelax in 44.01047921180725 seconds ProteinMPNN generated 1 sequences in 0 seconds Struct: design_ppi_0_dldesign_0 reported success in 48 seconds Attempting pose: design_ppi_0_dldesign_1 ProteinMPNN generated 1 sequences in 0 seconds Running FastRelax Completed one cycle of FastRelax in 42.61138868331909 seconds ProteinMPNN generated 1 sequences in 0 seconds Struct: design_ppi_0_dldesign_1 reported success in 43 seconds Attempting pose: design_ppi_0_dldesign_2 ProteinMPNN generated 1 sequences in 0 seconds Running FastRelax Completed one cycle of FastRelax in 45.7072536945343 seconds ProteinMPNN generated 1 sequences in 0 seconds ... Struct: design_ppi_9_dldesign_2 reported success in 81 seconds Attempting pose: design_ppi_9_dldesign_3 ProteinMPNN generated 1 sequences in 0 seconds Running FastRelax Completed one cycle of FastRelax in 63.77409362792969 seconds ProteinMPNN generated 1 sequences in 0 seconds Struct: design_ppi_9_dldesign_3 reported success in 65 seconds [user@cn0071 ~]$ af2_predict.py -h ┌──────────────────────────────────────────────────────────────────────────────┐ │ PyRosetta-4 │ │ Created in JHU by Sergey Lyskov and PyRosetta Team │ │ (C) Copyright Rosetta Commons Member Institutions │ │ │ │ NOTE: USE OF PyRosetta FOR COMMERCIAL PURPOSES REQUIRE PURCHASE OF A LICENSE │ │ See LICENSE.PyRosetta.md or email license@uw.edu for details │ └──────────────────────────────────────────────────────────────────────────────┘ PyRosetta-4 2024 [Rosetta PyRosetta4.Release.python311.linux 2024.39+release.59628fbc5bc09f1221e1642f1f8d157ce49b1410 2024-09-23T07:49:48] retrieved from: http://www.pyrosetta.org usage: predict.py [-h] [-pdbdir PDBDIR] [-silent SILENT] [-outpdbdir OUTPDBDIR] [-outsilent OUTSILENT] [-runlist RUNLIST] [-checkpoint_name CHECKPOINT_NAME] [-scorefilename SCOREFILENAME] [-maintain_res_numbering] [-debug] [-max_amide_dist MAX_AMIDE_DIST] [-recycle RECYCLE] [-no_initial_guess] [-force_monomer] options: -h, --help show this help message and exit -pdbdir PDBDIR The name of a directory of pdbs to run through the model -silent SILENT The name of a silent file to run through the model -outpdbdir OUTPDBDIR The directory to which the output PDB files will be written. Only used when -pdbdir is active -outsilent OUTSILENT The name of the silent file to which output structs will be written. Only used when -silent is active -runlist RUNLIST The path of a list of pdb tags to run. Only used when -pdbdir is active (default: ''; Run all PDBs) -checkpoint_name CHECKPOINT_NAME The name of a file where tags which have finished will be written (default: check.point) -scorefilename SCOREFILENAME The name of a file where scores will be written (default: out.sc) -maintain_res_numbering When active, the model will not renumber the residues when bad inputs are encountered (default: False) -debug When active, errors will cause the script to crash and the error message to be printed out (default: False) -max_amide_dist MAX_AMIDE_DIST The maximum distance between an amide bond's carbon and nitrogen (default: 3.0) -recycle RECYCLE The number of AF2 recycles to perform (default: 3) -no_initial_guess When active, the model will not use an initial guess (default: False) -force_monomer When active, the model will predict the structure of a monomer (default: False) [user@cn0071 ~]$ af2_predict.py -silent ./dl_binder_design/examples/inputs/proteinmpnn_output.silent ...End the interactive session:
[user@cn0071 ~]$ exit salloc.exe: Relinquishing job allocation 46116226 [user@biowulf ~]$