ProteinMPNN: robust deep learning–based protein sequence design

ProteinMPNN: robust deep learning–based protein sequence design

Quick Links

ProteinMPNN is a deep learning–based protein sequence design method. Unlike AlphaFold and Rosettafold, which both predict protein structures from sequence, ProteinMPNN tries to solve the inverse problem, to find a sequence that matches a protein backbone.

References:

J.Dauparas, I.Anishchenko, N.Bennett, H.Bai, R.J.Ragotte, L.F.Milles, B.I.M.Wicky, A.Courbet, R.J.de Haas, N.Bethel, P.J.Y.Leung, T.F.Huddy, S.Pellock, D.Tischer, F.Chan, B. Koepnick, H. Nguyen, A. Kang, B. Sankaran, A.K.Bera, N.P.King, D. Baker
Robust deep learning–based protein sequence design using ProteinMPNN
Science 378, 49–56 (2022) .

Documentation

Important Notes

Module Name: ProteinMPNN (see the modules page for more information)
Unusual environment variables set
- PROTEINMPNN_HOME installation directory
- PROTEINMPNN_BIN executable directory
- PROTEINMPNN_SRC source code directory
- PROTEINMPNN_DATA sample data directory

Interactive job

Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Sample session:

[user@biowulf]$ interactive --gres=gpu:a100:1,lscratch:10 --mem=24g -c8
[user@cn0071 ~]$ module load ProteinMPNN
[+] Loading singularity  4.1.5  on cn0071
[+] Loading ProteinMPNN  1.0.1

[user@cn0071 ~]$ protein_mpnn_run.py -h 
usage: protein_mpnn_run.py [-h] [--ca_only] [--path_to_model_weights PATH_TO_MODEL_WEIGHTS] [--model_name MODEL_NAME]
                           [--seed SEED] [--save_score SAVE_SCORE] [--save_probs SAVE_PROBS]
                           [--score_only SCORE_ONLY] [--conditional_probs_only CONDITIONAL_PROBS_ONLY]
                           [--conditional_probs_only_backbone CONDITIONAL_PROBS_ONLY_BACKBONE]
                           [--unconditional_probs_only UNCONDITIONAL_PROBS_ONLY] [--backbone_noise BACKBONE_NOISE]
                           [--num_seq_per_target NUM_SEQ_PER_TARGET] [--batch_size BATCH_SIZE]
                           [--max_length MAX_LENGTH] [--sampling_temp SAMPLING_TEMP] [--out_folder OUT_FOLDER]
                           [--pdb_path PDB_PATH] [--pdb_path_chains PDB_PATH_CHAINS] [--jsonl_path JSONL_PATH]
                           [--chain_id_jsonl CHAIN_ID_JSONL] [--fixed_positions_jsonl FIXED_POSITIONS_JSONL]
                           [--omit_AAs OMIT_AAS] [--bias_AA_jsonl BIAS_AA_JSONL]
                           [--bias_by_res_jsonl BIAS_BY_RES_JSONL] [--omit_AA_jsonl OMIT_AA_JSONL]
                           [--pssm_jsonl PSSM_JSONL] [--pssm_multi PSSM_MULTI] [--pssm_threshold PSSM_THRESHOLD]
                           [--pssm_log_odds_flag PSSM_LOG_ODDS_FLAG] [--pssm_bias_flag PSSM_BIAS_FLAG]
                           [--tied_positions_jsonl TIED_POSITIONS_JSONL]

options:
  -h, --help            show this help message and exit
  --ca_only             Parse CA-only structures and use CA-only models (default: false) (default: False)
  --path_to_model_weights PATH_TO_MODEL_WEIGHTS
                        Path to model weights folder; (default: )
  --model_name MODEL_NAME
                        ProteinMPNN model name: v_48_002, v_48_010, v_48_020, v_48_030; v_48_010=version with 48
                        edges 0.10A noise (default: v_48_020)
  --seed SEED           If set to 0 then a random seed will be picked; (default: 0)
  --save_score SAVE_SCORE
                        0 for False, 1 for True; save score=-log_prob to npy files (default: 0)
  --save_probs SAVE_PROBS
                        0 for False, 1 for True; save MPNN predicted probabilites per position (default: 0)
  --score_only SCORE_ONLY
                        0 for False, 1 for True; score input backbone-sequence pairs (default: 0)
  --conditional_probs_only CONDITIONAL_PROBS_ONLY
                        0 for False, 1 for True; output conditional probabilities p(s_i given the rest of the
                        sequence and backbone) (default: 0)
  --conditional_probs_only_backbone CONDITIONAL_PROBS_ONLY_BACKBONE
                        0 for False, 1 for True; if true output conditional probabilities p(s_i given backbone)
                        (default: 0)
  --unconditional_probs_only UNCONDITIONAL_PROBS_ONLY
                        0 for False, 1 for True; output unconditional probabilities p(s_i given backbone) in one
                        forward pass (default: 0)
  --backbone_noise BACKBONE_NOISE
                        Standard deviation of Gaussian noise to add to backbone atoms (default: 0.0)
  --num_seq_per_target NUM_SEQ_PER_TARGET
                        Number of sequences to generate per target (default: 1)
  --batch_size BATCH_SIZE
                        Batch size; can set higher for titan, quadro GPUs, reduce this if running out of GPU memory
                        (default: 1)
  --max_length MAX_LENGTH
                        Max sequence length (default: 200000)
  --sampling_temp SAMPLING_TEMP
                        A string of temperatures, 0.2 0.25 0.5. Sampling temperature for amino acids. Suggested
                        values 0.1, 0.15, 0.2, 0.25, 0.3. Higher values will lead to more diversity. (default: 0.1)
  --out_folder OUT_FOLDER
                        Path to a folder to output sequences, e.g. /home/out/ (default: None)
  --pdb_path PDB_PATH   Path to a single PDB to be designed (default: )
  --pdb_path_chains PDB_PATH_CHAINS
                        Define which chains need to be designed for a single PDB (default: )
  --jsonl_path JSONL_PATH
                        Path to a folder with parsed pdb into jsonl (default: None)
  --chain_id_jsonl CHAIN_ID_JSONL
                        Path to a dictionary specifying which chains need to be designed and which ones are fixed, if
                        not specied all chains will be designed. (default: )
  --fixed_positions_jsonl FIXED_POSITIONS_JSONL
                        Path to a dictionary with fixed positions (default: )
  --omit_AAs OMIT_AAS   Specify which amino acids should be omitted in the generated sequence, e.g. 'AC' would omit
                        alanine and cystine. (default: X)
  --bias_AA_jsonl BIAS_AA_JSONL
                        Path to a dictionary which specifies AA composion bias if neededi, e.g. {A: -1.1, F: 0.7}
                        would make A less likely and F more likely. (default: )
  --bias_by_res_jsonl BIAS_BY_RES_JSONL
                        Path to dictionary with per position bias. (default: )
  --omit_AA_jsonl OMIT_AA_JSONL
                        Path to a dictionary which specifies which amino acids need to be omited from design at
                        specifi chain indices (default: )
  --pssm_jsonl PSSM_JSONL
                        Path to a dictionary with pssm (default: )
  --pssm_multi PSSM_MULTI
                        A value between [0.0, 1.0], 0.0 means do not use pssm, 1.0 ignore MPNN predictions (default:
                        0.0)
  --pssm_threshold PSSM_THRESHOLD
                        A value between -inf + inf to restric per position AAs (default: 0.0)
  --pssm_log_odds_flag PSSM_LOG_ODDS_FLAG
                        0 for False, 1 for True (default: 0)
  --pssm_bias_flag PSSM_BIAS_FLAG
                        0 for False, 1 for True (default: 0)
  --tied_positions_jsonl TIED_POSITIONS_JSONL
                        Path to a dictionary with tied positions (default: )
[user@cn0071 ~]$ git clone https://github.com/dauparas/ProteinMPNN
[user@cn0071 ~]$ protein_mpnn_run.py  \
                    --pdb_path ProteinMPNN/inputs/PDB_complexes/pdbs/3HTN.pdb \
                    --out_folder ./3HTN_designs/   \
                    --num_seq_per_target 10   \
                    --sampling_temp "0.1"  \
                    --seed 0  \
                    --batch_size 1 \
                    --model_name v_48_020
----------------------------------------
chain_id_jsonl is NOT loaded
----------------------------------------
fixed_positions_jsonl is NOT loaded
----------------------------------------
pssm_jsonl is NOT loaded
----------------------------------------
omit_AA_jsonl is NOT loaded
----------------------------------------
bias_AA_jsonl is NOT loaded
----------------------------------------
tied_positions_jsonl is NOT loaded
----------------------------------------
bias by residue dictionary is not loaded, or not provided
----------------------------------------
----------------------------------------
Number of edges: 48
Training noise level: 0.2A
Generating sequences for: 3HTN
10 sequences of length 429 generated in 9.4706 seconds
[user@cn0071 ~]$ dl_interface_design.py -h
┌──────────────────────────────────────────────────────────────────────────────┐
│                                 PyRosetta-4                                  │
│              Created in JHU by Sergey Lyskov and PyRosetta Team              │
│              (C) Copyright Rosetta Commons Member Institutions               │
│                                                                              │
│ NOTE: USE OF PyRosetta FOR COMMERCIAL PURPOSES REQUIRE PURCHASE OF A LICENSE │
│         See LICENSE.PyRosetta.md or email license@uw.edu for details         │
└──────────────────────────────────────────────────────────────────────────────┘
PyRosetta-4 2024 [Rosetta PyRosetta4.Release.python39.linux 2024.39+release.59628fbc5bc09f1221e1642f1f8d157ce49b1410 2024-09-23T07:49:48] retrieved from: http://www.pyrosetta.org
usage: dl_interface_design.py [-h] [-pdbdir PDBDIR] [-silent SILENT] [-outpdbdir OUTPDBDIR] [-outsilent OUTSILENT]
                              [-runlist RUNLIST] [-checkpoint_name CHECKPOINT_NAME] [-debug]
                              [-relax_cycles RELAX_CYCLES] [-output_intermediates] [-seqs_per_struct SEQS_PER_STRUCT]
                              [-checkpoint_path CHECKPOINT_PATH] [-temperature TEMPERATURE]
                              [-augment_eps AUGMENT_EPS] [-protein_features PROTEIN_FEATURES] [-omit_AAs OMIT_AAS]
                              [-bias_AA_jsonl BIAS_AA_JSONL] [-num_connections NUM_CONNECTIONS]

optional arguments:
  -h, --help            show this help message and exit
  -pdbdir PDBDIR        The name of a directory of pdbs to run through the model
  -silent SILENT        The name of a silent file to run through the model
  -outpdbdir OUTPDBDIR  The directory to which the output PDB files will be written, used if the -pdbdir arg is
                        active
  -outsilent OUTSILENT  The name of the silent file to which output structs will be written, used if the -silent arg
                        is active
  -runlist RUNLIST      The path of a list of pdb tags to run, only active when the -pdbdir arg is active (default:
                        ''; Run all PDBs)
  -checkpoint_name CHECKPOINT_NAME
                        The name of a file where tags which have finished will be written (default: check.point)
  -debug                When active, errors will cause the script to crash and the error message to be printed out
                        (default: False)
  -relax_cycles RELAX_CYCLES
                        The number of relax cycles to perform on each structure (default: 1)
  -output_intermediates
                        Whether to write all intermediate sequences from the relax cycles to disk (default: False)
  -seqs_per_struct SEQS_PER_STRUCT
                        The number of sequences to generate for each structure (default: 1)
  -checkpoint_path CHECKPOINT_PATH
                        The path to the ProteinMPNN weights you wish to use, default
                        /opt/dl_binder_design/mpnn_fr/ProteinMPNN/vanilla_model_weights/v_48_020.pt
  -temperature TEMPERATURE
                        The sampling temperature to use when running ProteinMPNN (default: 0.000001)
  -augment_eps AUGMENT_EPS
                        The variance of random noise to add to the atomic coordinates (default 0)
  -protein_features PROTEIN_FEATURES
                        What type of protein features to input to ProteinMPNN (default: full)
  -omit_AAs OMIT_AAS    A string of all residue types (one letter case-insensitive) that you would not like to use
                        for design. Letters not corresponding to residue types will be ignored (default: CX)
  -bias_AA_jsonl BIAS_AA_JSONL
                        The path to a JSON file containing a dictionary mapping residue one-letter names to the bias
                        for that residue eg. {A: -1.1, F: 0.7} (default: ; no bias)
  -num_connections NUM_CONNECTIONS
                        Number of neighbors each residue is connected to. Do not mess around with this argument
                        unless you have a specific set of ProteinMPNN weights which expects a different number of
                        connections. (default: 48)

[user@cn0071 ~]$ dl_interface_design.py -silent ./dl_binder_design/examples/inputs/proteinmpnn_output.silent
┌──────────────────────────────────────────────────────────────────────────────┐
│                                 PyRosetta-4                                  │
│              Created in JHU by Sergey Lyskov and PyRosetta Team              │
│              (C) Copyright Rosetta Commons Member Institutions               │
│                                                                              │
│ NOTE: USE OF PyRosetta FOR COMMERCIAL PURPOSES REQUIRE PURCHASE OF A LICENSE │
│         See LICENSE.PyRosetta.md or email license@uw.edu for details         │
└──────────────────────────────────────────────────────────────────────────────┘
PyRosetta-4 2024 [Rosetta PyRosetta4.Release.python39.linux 2024.39+release.59628fbc5bc09f1221e1642f1f8d157ce49b1410 2024-09-23T07:49:48] retrieved from: http://www.pyrosetta.org
Found GPU will run ProteinMPNN on GPU
Attempting pose: design_ppi_0_dldesign_0
ProteinMPNN generated 1 sequences in 4 seconds
Running FastRelax
Completed one cycle of FastRelax in 44.01047921180725 seconds
ProteinMPNN generated 1 sequences in 0 seconds
Struct: design_ppi_0_dldesign_0 reported success in 48 seconds
Attempting pose: design_ppi_0_dldesign_1
ProteinMPNN generated 1 sequences in 0 seconds
Running FastRelax
Completed one cycle of FastRelax in 42.61138868331909 seconds
ProteinMPNN generated 1 sequences in 0 seconds
Struct: design_ppi_0_dldesign_1 reported success in 43 seconds
Attempting pose: design_ppi_0_dldesign_2
ProteinMPNN generated 1 sequences in 0 seconds
Running FastRelax
Completed one cycle of FastRelax in 45.7072536945343 seconds
ProteinMPNN generated 1 sequences in 0 seconds
...
Struct: design_ppi_9_dldesign_2 reported success in 81 seconds
Attempting pose: design_ppi_9_dldesign_3
ProteinMPNN generated 1 sequences in 0 seconds
Running FastRelax
Completed one cycle of FastRelax in 63.77409362792969 seconds
ProteinMPNN generated 1 sequences in 0 seconds
Struct: design_ppi_9_dldesign_3 reported success in 65 seconds

[user@cn0071 ~]$ af2_predict.py -h
┌──────────────────────────────────────────────────────────────────────────────┐
│                                 PyRosetta-4                                  │
│              Created in JHU by Sergey Lyskov and PyRosetta Team              │
│              (C) Copyright Rosetta Commons Member Institutions               │
│                                                                              │
│ NOTE: USE OF PyRosetta FOR COMMERCIAL PURPOSES REQUIRE PURCHASE OF A LICENSE │
│         See LICENSE.PyRosetta.md or email license@uw.edu for details         │
└──────────────────────────────────────────────────────────────────────────────┘
PyRosetta-4 2024 [Rosetta PyRosetta4.Release.python311.linux 2024.39+release.59628fbc5bc09f1221e1642f1f8d157ce49b1410 2024-09-23T07:49:48] retrieved from: http://www.pyrosetta.org
usage: predict.py [-h] [-pdbdir PDBDIR] [-silent SILENT] [-outpdbdir OUTPDBDIR] [-outsilent OUTSILENT]
                  [-runlist RUNLIST] [-checkpoint_name CHECKPOINT_NAME] [-scorefilename SCOREFILENAME]
                  [-maintain_res_numbering] [-debug] [-max_amide_dist MAX_AMIDE_DIST] [-recycle RECYCLE]
                  [-no_initial_guess] [-force_monomer]

options:
  -h, --help            show this help message and exit
  -pdbdir PDBDIR        The name of a directory of pdbs to run through the model
  -silent SILENT        The name of a silent file to run through the model
  -outpdbdir OUTPDBDIR  The directory to which the output PDB files will be written. Only used when -pdbdir is active
  -outsilent OUTSILENT  The name of the silent file to which output structs will be written. Only used when -silent
                        is active
  -runlist RUNLIST      The path of a list of pdb tags to run. Only used when -pdbdir is active (default: ''; Run all
                        PDBs)
  -checkpoint_name CHECKPOINT_NAME
                        The name of a file where tags which have finished will be written (default: check.point)
  -scorefilename SCOREFILENAME
                        The name of a file where scores will be written (default: out.sc)
  -maintain_res_numbering
                        When active, the model will not renumber the residues when bad inputs are encountered
                        (default: False)
  -debug                When active, errors will cause the script to crash and the error message to be printed out
                        (default: False)
  -max_amide_dist MAX_AMIDE_DIST
                        The maximum distance between an amide bond's carbon and nitrogen (default: 3.0)
  -recycle RECYCLE      The number of AF2 recycles to perform (default: 3)
  -no_initial_guess     When active, the model will not use an initial guess (default: False)
  -force_monomer        When active, the model will predict the structure of a monomer (default: False)
[user@cn0071 ~]$ af2_predict.py -silent ./dl_binder_design/examples/inputs/proteinmpnn_output.silent
...

End the interactive session:

[user@cn0071 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$