MPSAN-MP: Mesasage Passing and Self Attention-based Networks for Molecular Property prediction
The MPSAN_MP application employs neural networks implementing the Message Passing or Self-Attention
mechanism for Molecular Property prediction. It takes as input a set of molecules represented by
either graphs or SMILES strings. For each molecule, it predicts a propery value (PV), which may be either
a discrete/binary label (classification task) or a continuous value (regression task).
of molecular property prediction:
This application is been be used as a biological example by the class #7 of the course
"Deep Learning by Example on Biowulf".
References:
- Alexander Kensert,
Message-passing neural network (MPNN) for molecular property prediction
https://keras.io/examples/graph/mpnn-molecular-graphs/ - Esben Jannik Bjerrum,
SMILES Enumeration as Data Augmentation for Neural Network Modeling of Molecules
arXiv:1703.07076 [cs.LG] - Sheng Wang, Yuzhi Guo, Yuhong Wang, Hongmao Sun and Junzhou Huan,
SMILES-BERT: Large Scale Unsupervised Pre-Training for Molecular Property Prediction
BCB '19: Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, Pages 429 - 436 - Xiao-Chen Zhang, Cheng-Kun Wu, Zhi-Jiang Yang, Zhen-Xing Wu, Jia-Cai Yi,
Chang-Yu Hsieh, Ting-Jun Hou and Dong-Sheng Cao,
MG-BERT: leveraging unsupervised atomic representation learning for molecular property prediction
Briefings in Bioinformatics, 22(6), 2021, 1–14
Documentation
- MPNN on Keras website
- SMILES-enumeration on GitHub
- SMILES-BERT on GitHub
- Molecular-graph-BERT on GitHub
Important Notes
- Module Name: MPSAN_MP (see the modules page for more information)
- Unusual environment variables set
- MPSAN_MP_HOME installation directory
- MPSAN_MP_BIN executable directory
- MPSAN_MP_SRC source code directory
- MPSAN_MP_DATA data folder
- MPSAN_MP_CHECKPOINTS checkpoints folder
- MPSAN_MP_RESULTS results folder
Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.
Allocate an interactive session and run the program. Sample session:
[user@biowulf]$ sinteractive --mem=64g --gres=gpu:a100:1,lscratch:100 -c24 [user@cn2893 ~]$module load mpsan_mp [+] Loading singularity 4.0.1 on cn2893 [+] Loading mpsan_mp 20240817(a lower-case name, mpsan_mp, will also work).
The key challenge in predicting the property values (PVs) for drug molecules, such as their biological activity or solubility, is a limited amount of training data with ground truth labels/PVs, since measuring the PVs is laborious and expensive procedure. As a consequence, supervised training of the full-scale models on available data may result in over-fitting the data, which prevents the model from generalization, i.e. making accurate predictions on data that were not a part of the training dataset. The MPSAN_MP application explores two practical approaches to addressing the challenge of over-fitting:
(1) increasing the effective size of a training dataset through augmentation / enumeration of the input SMILES strings; and
(2) decreasing the effective number of adjustable parameters in the model by exploiting the concept of transfer learning.
The data used by this application is an extended set of data used by the ReLeaSE application discussed in the Deep Learning class #5:
- jak2 dataset comprising ~2K of SMILES strings together with continuous property value, the Janus protein kinase 2 inhibition coefficient (JAK2);
- logp dataset comprising ~14K of SMILES strings representing biologically active molecules together with their corresponding (continuous) property values - in this case, the n-octanol-water partition coefficient, logP, which is a measure of lipophilicity of a molecule;
- bbbp dataset comprising ~2K of SMILES strings togeter with (discrete) binary labels indicating whether or not a molecule can penetrate the blood/brain barrier (BBB) membrane;
- blogp dataset of size ~14K, a "binarized" version of the logp dataset, comprising the same SMILES strings and the binary labels = 1 if the logp property value is above the threshold=1.88 or = 0 otherwise;
- chembl dataset comprising > 2.4M of SMILES strings from ChEMBL-33 database, representing molecules with drug-like properties from the ChEMBL database; and
- zinc dataset - a random subset of size 25M out of the total several hundred million SMILES strings from ZINC database.
After adopting/reimplementation in Keras, the code involves four executables:
- preporocess.py
- with appropriate command line arguments, can operate in three modes: reformat, augment and binarize;
- in the reformat mode will convert any of the datasets listed above, except blogp, from its original format available on the Web to the standard format accepted by the MPSAN_MP application;
- in the augment mode, will enumerate the SMILES string of each entry in the reformatted dataset, keeping its property value the same, thus increasing the size of a dataset by approx. the number of folds specified by the augmentation_coefficient (default = 10); and
- in the binarize mode, will transform the reformatted logp dataset to the blogp dataset.
- train.py
- depending on the model and data used, can operate three modes: training, pretraining and finetuning
- in the training mode, employs either mpn or san model and takes as input the data of type jak2, jak2a, logp, logpa, bbbp, bbbpa, blogp or bvlogpa, stored in CSV format, i.e. *.csv, and comprising SMILES strings together with corresponding property values; outputs a checkpoint file in the HDF5 format, i.e. *.h5
- in the pretraining mode, employs san_bert model and take as input either chembl or zinc data stored in the CSV file, i.e. *.csv, and comprising the SMILES strings (without property values); outputs checkpoint files in the HDF5 format, i.e. *.h5;
- in the finetuning mode, employs san_bert model and takes as input (1) one of the checkpoint files in HDF5 format produced at the pretraining step, and (2) data in the CSV format comprising SMILES strings together with corresponding property values; outputs a checkpoint file in the HDF5 format, i.e. *.h5
- predict.py
- takes as input (1) the checkpoint file(s) in HDF format produced in either training or pretraining and finetuning steps and (2) testing data in the CSV format comprising SMILES strings together with corresponding property values; outputs a results file in the TSV format
- visualize.R
- takes as input results files in the TSV format
- outputs an image file in PNG format, i.e. *.png
To display a usage message of a python executable, type its name followed by the option "-h". For example:
[user@cn2893 ~]$ preprocess.py -h usage: preprocess.py [-h] [-a] [-A augm_coeff] [-b] -d data_name -i input_file -o output_file [-r] optional arguments: -h, --help show this help message and exit -a, --augment augment the data by enumerating SMILES -A augm_coeff, --augm_coeff augm_coeff augmentation_coefficient; fault = 10 -b, --binarize binarize the logp data -r, --reformat reformat the data to make them acceptable by MPTN_MP software required arguments: -d data_name, --data_name data_name Data type: jak2 | logp | bbbp | chembl | zinc -i input_file, --input_file input_file input file; fault = None -o output_file, --output_file output_file output fileThe executable expects that the input file has been downladed and/or is available in the csv format.
Examples of usage:
- to reformat the chembl dataset:
[user@cn2893 ~]$ wget https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_33/chembl_33_chemreps.txt.gz [user@cn2893 ~]$ gunzip chembl_33_chemreps.txt.gz [user@cn2893 ~]$ preprocess.py -r -d chembl -i chembl_33_chemreps.txt -o chembl_natoms,ntokens.csv- to augment jak2 data:
[user@cn2893 ~]$cp $MPSAN_MP_DATA/jak2_natoms,ntokens.csv . [user@cn2893 ~]$preprocess.py -a -d jak2 -i jak2_natoms,ntokens.csv -o jak2a_natoms,ntokens.csv- to binarize the logp data:
[user@cn2893 ~]$cp $MPSAN_MP_DAT/logp_natoms,ntokens.csv . [user@cn2893 ~]$preprocess.py -b -d logp -i logp_natoms,ntokens.csv -o blogp_natoms,ntokens.csvBefore using other executables, create a local folder "data" and copy one or more original data files into that folder:
[user@cn2893 ~]$ mkdir -p data [user@cn2893 ~]$ cp $MPSAN_MP_DATA/* dataThe mode of usage of the executable train.py depends on the (required) data and model names passed to it through command line options:
[user@cn2893 ~]$ train.py -h Using TensorFlow backend usage: train.py [-h] [-A num_heads] [-B num_ln_layers] [-D dropout_rate] [--data_path data_path] [--debug DEBUG] [--embed_dim EMBED_DIM] [--frac_valid FRAC_VALID] [--frac_test FRAC_TEST] [-l learning_rate] [-i input_checkpoint_file] [-I encoder_checkpoint_file] [--LF L] [-E num_encoder_layers] [--num_ft_layers num_ft_layers] [--num_t_layers num_t_layers] -m model_name [-M M] [--max_length MAX_LENGTH] [--mp_iterations mp_iterations] [--NF N] [-e num_epochs] [-O optimizer] [-p pdata_nane] [-P POSITION_EMBEDDING] [-T] [-t TOKENIZER] [--trainable_encoder] [-U num_dense_units] [-v] [-w] [-b batch_size] [--best_accuracy best_accuracy] [--best_loss best_loss] -d DATA_NAME [-n NUM_DATA] [--es_patience patience] [-S smiles] optional arguments: -h, --help show this help message and exit -A num_heads, --num_heads num_heads number of attention heads; default=4 -B num_ln_layers, --num_ln_layers num_ln_layers number of LayerNormalization layers; max = 2; default=2 -D dropout_rate, --dropout_rate dropout_rate dropout_rate; default = 0.1 --data_path data_path path to data file (overrides the option -d --debug DEBUG use a small subset of data for training --embed_dim EMBED_DIM the dimension of the embedding space; default: 1024 --frac_valid FRAC_VALID fraction of data to be used for validation; default=0.05 --frac_test FRAC_TEST fraction of data to be used for testing; default=0.15 -l learning_rate, --lr learning_rate learning rate; default=1.e-4 -i input_checkpoint_file, --input_checkpoint_file input_checkpoint_file input checkpoint file; fault = None -I encoder_checkpoint_file, --input_encoder checkpoint_file encoder_checkpoint_file input encoder checkpoint file; fault = None --LF L num of link/edge features in synthetic data; default=2 -E num_encoder_layers, --num_enc_layers num_encoder_layers number of layers in BERT Encoder; default=6 --num_ft_layers num_ft_layers number of dense layers in finetuner: 1, 2 or 3; default=1 --num_t_layers num_t_layers number of transformer layers in the san model; default=1 -M M num of nodes/atoms in synthetic data; default=5 --max_length MAX_LENGTH max # tokens of input SMILES to be used for training; default=64 --mp_iterations mp_iterations # of MP iterations; default=4 --NF N num of node/atom features in synthetic data; default=3 -e num_epochs, --num_epochs num_epochs # of epochs; default=40 if data_name=="chembl", =3000 otherwise -O optimizer, --optimizer optimizer optimizer: adam | rmsprop -p pdata_nane, --pretraining_data_name pdata_nane name of the data trhat were used at pretraining step -P POSITION_EMBEDDING, --position_embedding POSITION_EMBEDDING 0=none(default);1=regular,2=sine;3=regular_concat;4=sine_concat -T, --transpose transpose/permute data tensor after embedding -t TOKENIZER, --tokenizer TOKENIZER tokenizer used by san-based model: smiles_pe or atom_symbols; default=smiles_pe --trainable_encoder Encoder is trainable even if not data=chembl; default=False -U num_dense_units, --num_dense_units num_dense_units number of dense units; default=1024 -v, --verbose increase the verbosity level of output -w, --load_weights read weights from a checkpoint file -b batch_size, --bs batch_size batch size; default=256 --best_accuracy best_accuracy initial value of best_accuracy; default = 0 --best_loss best_loss initial value of best_loss; default = 10000 -n NUM_DATA, --num_data NUM_DATA number of input SMILES strings to be used; default=all --es_patience patience patience for early stopping of training; default=0 (no early stopping) -S smiles, --smiles smiles (Comma-separated list of) smiles to be used as input for debugging required arguments: -m model_name, --model_name model_name model name: mpn | san | san_bert; default = mpn required arguments: -d DATA_NAME, --data_name DATA_NAME Data type: jak2(a) | (b)logp(a) | bbbp(a) | chembl | zincExamples of usage of the train.py executable:
- to train the mpn model on jak2 data under default values of all hyperparameters
[user@cn2893 ~]$ train.py -m mpn -d jak2 # training mode- a similar command to train san model on augmented logp data:
[user@cn2893 ~]$ train.py -m san -d logpa # training mode ...- to pretrain san_bert model on chembl data:
[user@cn2893 ~]$train.py -m san_bert -d chembl # pretraining mode- to finetune the san_bert on bbbp data, using the source model pretrained on zinc data:
[user@cn2893 ~]$train.py -d bbbp -m san_bert -p zinc \ # finetuning mode -i checkpoints/san_bert.zinc.bbbpa.weights.h5 \ -I checkpoints/bert_encoder.zinc.san_bert.weights.h5 ... Correct= 243 / 268 %error= 9.328358208955223- a similar command, but using chembl as the source dataset:
[user@cn2893 ~]$train.py -d bbbp -m san_bert -p chembl \ # finetuning mode -i checkpoints/san_bert.chembl.bbbpa.weights.h5 \ -I checkpoints/bert_encoder.chembl.san_bert.weights.h5 ... Correct= 251 / 268 %error= 6.343283582089554The command line options for the executable predict.py are similar to, and should be consistent with, the options that were passed to the training command:
[user@cn2893 ~]$ predict.py -h Using TensorFlow backend usage: predict.py [-h] [-A num_heads] [-B num_ln_layers] [-D dropout_rate] [--data_path data_path] [--debug DEBUG] [--embed_dim EMBED_DIM] [--frac_valid FRAC_VALID] [--frac_test FRAC_TEST] [-l learning_rate] [-i input_checkpoint_file] [-I encoder_checkpoint_file] [--LF L] [-E num_encoder_layers] [--num_ft_layers num_ft_layers] [--num_t_layers num_t_layers] -m model_name [-M M] [--max_length MAX_LENGTH] [--mp_iterations mp_iterations] [--NF N] [-e num_epochs] [-O optimizer] [-p pdata_nane] [-P POSITION_EMBEDDING] [-T] [-t TOKENIZER] [--trainable_encoder] [-U num_dense_units] [-v] [-w] [-b batch_size] -d DATA_NAME optional arguments: -h, --help show this help message and exit -A num_heads, --num_heads num_heads number of attention heads; default=4 -B num_ln_layers, --num_ln_layers num_ln_layers number of LayerNormalization layers; max = 2; default=2 -D dropout_rate, --dropout_rate dropout_rate dropout_rate; default = 0.1 --data_path data_path path to data file (overrides the option -d --debug DEBUG use a small subset of data for training --embed_dim EMBED_DIM the dimension of the embedding space; default: 1024 --frac_valid FRAC_VALID fraction of data to be used for validation; default=0.05 --frac_test FRAC_TEST fraction of data to be used for testing; default=0.15 -l learning_rate, --lr learning_rate learning rate; default=1.e-4 -i input_checkpoint_file, --input_checkpoint_file input_checkpoint_file input checkpoint file; fault = None -I encoder_checkpoint_file, --input_encoder checkpoint_file encoder_checkpoint_file input encoder checkpoint file; fault = None --LF L num of link/edge features in synthetic data; default=2 -E num_encoder_layers, --num_enc_layers num_encoder_layers number of layers in BERT Encoder; default=6 --num_ft_layers num_ft_layers number of dense layers in finetuner: 1, 2 or 3; default=1 --num_t_layers num_t_layers number of transformer layers in the san model; default=1 -M M num of nodes/atoms in synthetic data; default=5 --max_length MAX_LENGTH max # tokens of input SMILES to be used for training; default=64 --mp_iterations mp_iterations # of MP iterations; default=4 --NF N num of node/atom features in synthetic data; default=3 -e num_epochs, --num_epochs num_epochs # of epochs; default=40 if data_name=="chembl", =3000 otherwise -O optimizer, --optimizer optimizer optimizer: adam | rmsprop -p pdata_nane, --pretraining_data_name pdata_nane name of the data trhat were used at pretraining step -P POSITION_EMBEDDING, --position_embedding POSITION_EMBEDDING 0=none(default);1=regular,2=sine;3=regular_concat;4=sine_concat -T, --transpose transpose/permute data tensor after embedding -t TOKENIZER, --tokenizer TOKENIZER tokenizer used by san-based model: smiles_pe or atom_symbols; default=smiles_pe --trainable_encoder Encoder is trainable even if not data=chembl; default=False -U num_dense_units, --num_dense_units num_dense_units number of dense units; default=1024 -v, --verbose increase the verbosity level of output -w, --load_weights read weights from a checkpoint file -b batch_size, --bs batch_size batch size; default=256 required arguments: -m model_name, --model_name model_name model name: mpn | san | san_bert; default = mpn required arguments: -d DATA_NAME, --data_name DATA_NAME Data type: jak2(a) | (b)logp(a) | bbbp(a) | chembl | zincFor example,
- the command to perform predictions from the trained model mpn on testing data jak2 using the weights stored in the checkoint file checkpoints/mpn.jak2.weights.h5:
[user@cn2893 ~]$ predict.py -m mpn -d jak2 -i checkpoints/mpn.jak2.weights.h5 ...the results will be output to the file results/mpn.jak2.results.tsv:
[user@cn2893 ~]$ cat results/mpn.jak2.results.tsv # pred_jak2 corr_jak2 5.705802 7.080000 7.524415 6.460000 8.072280 7.050000 7.216453 8.220000 8.825060 9.260000 6.521371 7.600000 6.853940 6.820000 8.652765 8.050000 6.496820 7.460000 ... 7.646416 7.600000 7.397765 7.820000 7.131140 7.270000 6.656110 8.150000 7.219415 7.600000Finally, in order to visualize the summary provided in the results file, run the executable visualize.R. To display the usage message of this executable, type its name:
[user@cn2893 ~]$ module load R [user@cn2893 ~]$ visualize.R Usage: visualize.R <results_file> [ <other_results_file(s)> ]Examples of usage of this executable:
[user@cn2893 ~]$ visualize.R results/san.jak2.results.tsv results/mpn.jak2.results.tsv
[user@cn2893 ~]$ visualize.R results/san.jak2a.results.tsv results/mpn.jak2a.results.tsv


Since the SMILES enumeration / data augmentation does not modify an adjacency matrix representing a molecular graph, for the MPN model it will only result in increasing the effective size of a training dataset, thus preventing the model from overfitting and improving accuracy of PV predictions. On the other hand, the SAN model, which handles SMILES as a sequence and attempts to compute the attention scores between different positions in the input, may be confused by reshuffling the tokens/atoms positions resulting from the SMILES enumeration. This prevents the SAN model from taking advantage of the increased data size.
Results for logp (original) and logpa (augmented) data can be visualized by running similar commands:
[user@cn2893 ~]$ visualize.R results/san.logp.results.tsv results/mpn.logp.results.tsv
[user@cn2893 ~]$ visualize.R results/san.logpa.results.tsv results/mpn.logpa.results.tsv


[user@cn2893 ~]$ exit salloc.exe: Relinquishing job allocation 46116226 [user@biowulf ~]$