ProteinMPNN is a deep learning–based protein sequence design method. Unlike AlphaFold and Rosettafold, which both predict protein structures from sequence, ProteinMPNN tries to solve the inverse problem, to find a sequence that matches a protein backbone.
Allocate an interactive session and run the program. Sample session:
[user@biowulf]$ interactive --gres=gpu:a100:1,lscratch:10 --mem=24g -c8
[user@cn0071 ~]$ module load ProteinMPNN
[+] Loading singularity 4.1.5 on cn0071
[+] Loading ProteinMPNN 1.0.1
[user@cn0071 ~]$ protein_mpnn_run.py -h
usage: protein_mpnn_run.py [-h] [--ca_only] [--path_to_model_weights PATH_TO_MODEL_WEIGHTS] [--model_name MODEL_NAME]
[--seed SEED] [--save_score SAVE_SCORE] [--save_probs SAVE_PROBS]
[--score_only SCORE_ONLY] [--conditional_probs_only CONDITIONAL_PROBS_ONLY]
[--conditional_probs_only_backbone CONDITIONAL_PROBS_ONLY_BACKBONE]
[--unconditional_probs_only UNCONDITIONAL_PROBS_ONLY] [--backbone_noise BACKBONE_NOISE]
[--num_seq_per_target NUM_SEQ_PER_TARGET] [--batch_size BATCH_SIZE]
[--max_length MAX_LENGTH] [--sampling_temp SAMPLING_TEMP] [--out_folder OUT_FOLDER]
[--pdb_path PDB_PATH] [--pdb_path_chains PDB_PATH_CHAINS] [--jsonl_path JSONL_PATH]
[--chain_id_jsonl CHAIN_ID_JSONL] [--fixed_positions_jsonl FIXED_POSITIONS_JSONL]
[--omit_AAs OMIT_AAS] [--bias_AA_jsonl BIAS_AA_JSONL]
[--bias_by_res_jsonl BIAS_BY_RES_JSONL] [--omit_AA_jsonl OMIT_AA_JSONL]
[--pssm_jsonl PSSM_JSONL] [--pssm_multi PSSM_MULTI] [--pssm_threshold PSSM_THRESHOLD]
[--pssm_log_odds_flag PSSM_LOG_ODDS_FLAG] [--pssm_bias_flag PSSM_BIAS_FLAG]
[--tied_positions_jsonl TIED_POSITIONS_JSONL]
options:
-h, --help show this help message and exit
--ca_only Parse CA-only structures and use CA-only models (default: false) (default: False)
--path_to_model_weights PATH_TO_MODEL_WEIGHTS
Path to model weights folder; (default: )
--model_name MODEL_NAME
ProteinMPNN model name: v_48_002, v_48_010, v_48_020, v_48_030; v_48_010=version with 48
edges 0.10A noise (default: v_48_020)
--seed SEED If set to 0 then a random seed will be picked; (default: 0)
--save_score SAVE_SCORE
0 for False, 1 for True; save score=-log_prob to npy files (default: 0)
--save_probs SAVE_PROBS
0 for False, 1 for True; save MPNN predicted probabilites per position (default: 0)
--score_only SCORE_ONLY
0 for False, 1 for True; score input backbone-sequence pairs (default: 0)
--conditional_probs_only CONDITIONAL_PROBS_ONLY
0 for False, 1 for True; output conditional probabilities p(s_i given the rest of the
sequence and backbone) (default: 0)
--conditional_probs_only_backbone CONDITIONAL_PROBS_ONLY_BACKBONE
0 for False, 1 for True; if true output conditional probabilities p(s_i given backbone)
(default: 0)
--unconditional_probs_only UNCONDITIONAL_PROBS_ONLY
0 for False, 1 for True; output unconditional probabilities p(s_i given backbone) in one
forward pass (default: 0)
--backbone_noise BACKBONE_NOISE
Standard deviation of Gaussian noise to add to backbone atoms (default: 0.0)
--num_seq_per_target NUM_SEQ_PER_TARGET
Number of sequences to generate per target (default: 1)
--batch_size BATCH_SIZE
Batch size; can set higher for titan, quadro GPUs, reduce this if running out of GPU memory
(default: 1)
--max_length MAX_LENGTH
Max sequence length (default: 200000)
--sampling_temp SAMPLING_TEMP
A string of temperatures, 0.2 0.25 0.5. Sampling temperature for amino acids. Suggested
values 0.1, 0.15, 0.2, 0.25, 0.3. Higher values will lead to more diversity. (default: 0.1)
--out_folder OUT_FOLDER
Path to a folder to output sequences, e.g. /home/out/ (default: None)
--pdb_path PDB_PATH Path to a single PDB to be designed (default: )
--pdb_path_chains PDB_PATH_CHAINS
Define which chains need to be designed for a single PDB (default: )
--jsonl_path JSONL_PATH
Path to a folder with parsed pdb into jsonl (default: None)
--chain_id_jsonl CHAIN_ID_JSONL
Path to a dictionary specifying which chains need to be designed and which ones are fixed, if
not specied all chains will be designed. (default: )
--fixed_positions_jsonl FIXED_POSITIONS_JSONL
Path to a dictionary with fixed positions (default: )
--omit_AAs OMIT_AAS Specify which amino acids should be omitted in the generated sequence, e.g. 'AC' would omit
alanine and cystine. (default: X)
--bias_AA_jsonl BIAS_AA_JSONL
Path to a dictionary which specifies AA composion bias if neededi, e.g. {A: -1.1, F: 0.7}
would make A less likely and F more likely. (default: )
--bias_by_res_jsonl BIAS_BY_RES_JSONL
Path to dictionary with per position bias. (default: )
--omit_AA_jsonl OMIT_AA_JSONL
Path to a dictionary which specifies which amino acids need to be omited from design at
specifi chain indices (default: )
--pssm_jsonl PSSM_JSONL
Path to a dictionary with pssm (default: )
--pssm_multi PSSM_MULTI
A value between [0.0, 1.0], 0.0 means do not use pssm, 1.0 ignore MPNN predictions (default:
0.0)
--pssm_threshold PSSM_THRESHOLD
A value between -inf + inf to restric per position AAs (default: 0.0)
--pssm_log_odds_flag PSSM_LOG_ODDS_FLAG
0 for False, 1 for True (default: 0)
--pssm_bias_flag PSSM_BIAS_FLAG
0 for False, 1 for True (default: 0)
--tied_positions_jsonl TIED_POSITIONS_JSONL
Path to a dictionary with tied positions (default: )
[user@cn0071 ~]$ git clone https://github.com/dauparas/ProteinMPNN
[user@cn0071 ~]$ protein_mpnn_run.py \
--pdb_path ProteinMPNN/inputs/PDB_complexes/pdbs/3HTN.pdb \
--out_folder ./3HTN_designs/ \
--num_seq_per_target 10 \
--sampling_temp "0.1" \
--seed 0 \
--batch_size 1 \
--model_name v_48_020
----------------------------------------
chain_id_jsonl is NOT loaded
----------------------------------------
fixed_positions_jsonl is NOT loaded
----------------------------------------
pssm_jsonl is NOT loaded
----------------------------------------
omit_AA_jsonl is NOT loaded
----------------------------------------
bias_AA_jsonl is NOT loaded
----------------------------------------
tied_positions_jsonl is NOT loaded
----------------------------------------
bias by residue dictionary is not loaded, or not provided
----------------------------------------
----------------------------------------
Number of edges: 48
Training noise level: 0.2A
Generating sequences for: 3HTN
10 sequences of length 429 generated in 9.4706 seconds
[user@cn0071 ~]$ dl_interface_design.py -h
┌──────────────────────────────────────────────────────────────────────────────┐
│ PyRosetta-4 │
│ Created in JHU by Sergey Lyskov and PyRosetta Team │
│ (C) Copyright Rosetta Commons Member Institutions │
│ │
│ NOTE: USE OF PyRosetta FOR COMMERCIAL PURPOSES REQUIRE PURCHASE OF A LICENSE │
│ See LICENSE.PyRosetta.md or email license@uw.edu for details │
└──────────────────────────────────────────────────────────────────────────────┘
PyRosetta-4 2024 [Rosetta PyRosetta4.Release.python39.linux 2024.39+release.59628fbc5bc09f1221e1642f1f8d157ce49b1410 2024-09-23T07:49:48] retrieved from: http://www.pyrosetta.org
usage: dl_interface_design.py [-h] [-pdbdir PDBDIR] [-silent SILENT] [-outpdbdir OUTPDBDIR] [-outsilent OUTSILENT]
[-runlist RUNLIST] [-checkpoint_name CHECKPOINT_NAME] [-debug]
[-relax_cycles RELAX_CYCLES] [-output_intermediates] [-seqs_per_struct SEQS_PER_STRUCT]
[-checkpoint_path CHECKPOINT_PATH] [-temperature TEMPERATURE]
[-augment_eps AUGMENT_EPS] [-protein_features PROTEIN_FEATURES] [-omit_AAs OMIT_AAS]
[-bias_AA_jsonl BIAS_AA_JSONL] [-num_connections NUM_CONNECTIONS]
optional arguments:
-h, --help show this help message and exit
-pdbdir PDBDIR The name of a directory of pdbs to run through the model
-silent SILENT The name of a silent file to run through the model
-outpdbdir OUTPDBDIR The directory to which the output PDB files will be written, used if the -pdbdir arg is
active
-outsilent OUTSILENT The name of the silent file to which output structs will be written, used if the -silent arg
is active
-runlist RUNLIST The path of a list of pdb tags to run, only active when the -pdbdir arg is active (default:
''; Run all PDBs)
-checkpoint_name CHECKPOINT_NAME
The name of a file where tags which have finished will be written (default: check.point)
-debug When active, errors will cause the script to crash and the error message to be printed out
(default: False)
-relax_cycles RELAX_CYCLES
The number of relax cycles to perform on each structure (default: 1)
-output_intermediates
Whether to write all intermediate sequences from the relax cycles to disk (default: False)
-seqs_per_struct SEQS_PER_STRUCT
The number of sequences to generate for each structure (default: 1)
-checkpoint_path CHECKPOINT_PATH
The path to the ProteinMPNN weights you wish to use, default
/opt/dl_binder_design/mpnn_fr/ProteinMPNN/vanilla_model_weights/v_48_020.pt
-temperature TEMPERATURE
The sampling temperature to use when running ProteinMPNN (default: 0.000001)
-augment_eps AUGMENT_EPS
The variance of random noise to add to the atomic coordinates (default 0)
-protein_features PROTEIN_FEATURES
What type of protein features to input to ProteinMPNN (default: full)
-omit_AAs OMIT_AAS A string of all residue types (one letter case-insensitive) that you would not like to use
for design. Letters not corresponding to residue types will be ignored (default: CX)
-bias_AA_jsonl BIAS_AA_JSONL
The path to a JSON file containing a dictionary mapping residue one-letter names to the bias
for that residue eg. {A: -1.1, F: 0.7} (default: ; no bias)
-num_connections NUM_CONNECTIONS
Number of neighbors each residue is connected to. Do not mess around with this argument
unless you have a specific set of ProteinMPNN weights which expects a different number of
connections. (default: 48)
[user@cn0071 ~]$ dl_interface_design.py -silent ./dl_binder_design/examples/inputs/proteinmpnn_output.silent
┌──────────────────────────────────────────────────────────────────────────────┐
│ PyRosetta-4 │
│ Created in JHU by Sergey Lyskov and PyRosetta Team │
│ (C) Copyright Rosetta Commons Member Institutions │
│ │
│ NOTE: USE OF PyRosetta FOR COMMERCIAL PURPOSES REQUIRE PURCHASE OF A LICENSE │
│ See LICENSE.PyRosetta.md or email license@uw.edu for details │
└──────────────────────────────────────────────────────────────────────────────┘
PyRosetta-4 2024 [Rosetta PyRosetta4.Release.python39.linux 2024.39+release.59628fbc5bc09f1221e1642f1f8d157ce49b1410 2024-09-23T07:49:48] retrieved from: http://www.pyrosetta.org
Found GPU will run ProteinMPNN on GPU
Attempting pose: design_ppi_0_dldesign_0
ProteinMPNN generated 1 sequences in 4 seconds
Running FastRelax
Completed one cycle of FastRelax in 44.01047921180725 seconds
ProteinMPNN generated 1 sequences in 0 seconds
Struct: design_ppi_0_dldesign_0 reported success in 48 seconds
Attempting pose: design_ppi_0_dldesign_1
ProteinMPNN generated 1 sequences in 0 seconds
Running FastRelax
Completed one cycle of FastRelax in 42.61138868331909 seconds
ProteinMPNN generated 1 sequences in 0 seconds
Struct: design_ppi_0_dldesign_1 reported success in 43 seconds
Attempting pose: design_ppi_0_dldesign_2
ProteinMPNN generated 1 sequences in 0 seconds
Running FastRelax
Completed one cycle of FastRelax in 45.7072536945343 seconds
ProteinMPNN generated 1 sequences in 0 seconds
...
Struct: design_ppi_9_dldesign_2 reported success in 81 seconds
Attempting pose: design_ppi_9_dldesign_3
ProteinMPNN generated 1 sequences in 0 seconds
Running FastRelax
Completed one cycle of FastRelax in 63.77409362792969 seconds
ProteinMPNN generated 1 sequences in 0 seconds
Struct: design_ppi_9_dldesign_3 reported success in 65 seconds
[user@cn0071 ~]$ af2_predict.py -h
┌──────────────────────────────────────────────────────────────────────────────┐
│ PyRosetta-4 │
│ Created in JHU by Sergey Lyskov and PyRosetta Team │
│ (C) Copyright Rosetta Commons Member Institutions │
│ │
│ NOTE: USE OF PyRosetta FOR COMMERCIAL PURPOSES REQUIRE PURCHASE OF A LICENSE │
│ See LICENSE.PyRosetta.md or email license@uw.edu for details │
└──────────────────────────────────────────────────────────────────────────────┘
PyRosetta-4 2024 [Rosetta PyRosetta4.Release.python311.linux 2024.39+release.59628fbc5bc09f1221e1642f1f8d157ce49b1410 2024-09-23T07:49:48] retrieved from: http://www.pyrosetta.org
usage: predict.py [-h] [-pdbdir PDBDIR] [-silent SILENT] [-outpdbdir OUTPDBDIR] [-outsilent OUTSILENT]
[-runlist RUNLIST] [-checkpoint_name CHECKPOINT_NAME] [-scorefilename SCOREFILENAME]
[-maintain_res_numbering] [-debug] [-max_amide_dist MAX_AMIDE_DIST] [-recycle RECYCLE]
[-no_initial_guess] [-force_monomer]
options:
-h, --help show this help message and exit
-pdbdir PDBDIR The name of a directory of pdbs to run through the model
-silent SILENT The name of a silent file to run through the model
-outpdbdir OUTPDBDIR The directory to which the output PDB files will be written. Only used when -pdbdir is active
-outsilent OUTSILENT The name of the silent file to which output structs will be written. Only used when -silent
is active
-runlist RUNLIST The path of a list of pdb tags to run. Only used when -pdbdir is active (default: ''; Run all
PDBs)
-checkpoint_name CHECKPOINT_NAME
The name of a file where tags which have finished will be written (default: check.point)
-scorefilename SCOREFILENAME
The name of a file where scores will be written (default: out.sc)
-maintain_res_numbering
When active, the model will not renumber the residues when bad inputs are encountered
(default: False)
-debug When active, errors will cause the script to crash and the error message to be printed out
(default: False)
-max_amide_dist MAX_AMIDE_DIST
The maximum distance between an amide bond's carbon and nitrogen (default: 3.0)
-recycle RECYCLE The number of AF2 recycles to perform (default: 3)
-no_initial_guess When active, the model will not use an initial guess (default: False)
-force_monomer When active, the model will predict the structure of a monomer (default: False)
[user@cn0071 ~]$ cp -r $PDB_DATA .
[user@cn0071 ~]$ mkdir -p output_dir
[user@cn0071 ~]$ af2_predict.py -pdbdir $PROTEINMPNN_DATA -outpdbdir output_dir
...
[user@cn0071 ~]$
End the interactive session:
[user@cn0071 ~]$ exit salloc.exe: Relinquishing job allocation 46116226 [user@biowulf ~]$