DanQ: a hybrid convolutional and recurrent deep
neural network for quantifying the function of DNA sequences
DanQ is a hybrid convolutional and bi-directional long short-term
memory recurrent neural network framework for predicting
non-coding function de novo from sequence.
This application is being used as a biological example in class #2 of the course
"Deep Learning by Example on Biowulf".
References:
- Jian Zhou & Olga G Troyanskaya,
Predicting effects of noncoding variants with deep learning–based sequence model
Nature Methods volume 12, pages931–934 (2015)
- Daniel Quang and Xiaohui Xie,
DanQ: a hybrid convolutional and recurrent deep neural network
for quantifying the function of DNA sequences
Nucleic Acids Research, 2016, Vol. 44, No. 11 e107; doi: 10.1093/nar/gkw226
Documentation
Important Notes
- Module Name: DanQ (see the modules page for more information)
- Unusual environment variables set
- DANQ_HOME DanQ installation directory
- DANQ_BIN DanQ executable directory
- DANQ_SRC DanQ source code
- DANQ_DATA DanQ data folder
Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.
Allocate an interactive session and run the program. Sample session:
[user@biowulf]$ sinteractive --mem=96g --gres=gpu:v100x,lscratch:10 --cpus-per-task=14 [user@cn4466 ~]$module load DanQ [+] Loading cuDNN/8.0.3/CUDA-11.0 libraries... [+] Loading CUDA Toolkit 11.0.3 .... [+] Unloading meme 5.1.0 on cn2358 [+] Loading openmpi 2.1.1 for GCC 4.8.5 [+] Loading weblogo 3.6 [+] Loading DanQ 20210407 ...(a lower-case name, danq, for the module will also work).
The code comprises three executables: train.py, predict.py and visualize.py. To display a usage message for a given executable, type its name, followed by the option "-h". To example:
[user@cn4466 ~]$ train.py -h usage: train.py [-h] [-b batch_size] -d data_folder [-e num_epochs] [-f start_filters] [-g num_gpus] [-k kernel_size] [-l learning_rate] [-m model_name] [-M] [-n debug_size] [-o test_results] [-O optimizer] [-s] [-v] [-w] optional arguments: -h, --help show this help message and exit -b batch_size, --bs batch_size batch size; default=500 -e num_epochs, --num_epochs num_epochs number of epochs; default=60 -f start_filters, --start_filters start_filters number of filters used in the (1st) convolution layer; default=320 -g num_gpus, --num_gpus num_gpus number of gpus to use; default=1 -k kernel_size, --kernel_size kernel_size conv. kernel size; default=26 for DanQ and =8 for DeepSEA model -l learning_rate, --lr learning_rate learning rate; default=1.e-4 -m model_name, --model_name model_name model name: DanQ | DeepSEA -M, --motif_sequences predict motif sequences, rather than target labels -n debug_size, --num_debug_data debug_size number of training examples to use when debugging -o test_results, --output test_results output file with test results; default='test_results.h5' -O optimizer, --optimizer optimizer optimizer: adam | rmsprop | sgd -s, --schedule vary the learning rate according to a schedule -v, --verbose increase the verbosity level of output -w, --load_weights read weights from a checkpoint file required arguments: -d data_folder, --data_folder data_folder path to the data folderThis code comprises two network models: DanQ and DeepSEA. In order to perform training any model, first download available sample data:
[user@cn4466 ~]$ cp -r $DANQ_DATA/* .This command will copy to your current directory
- folder "data" that contains sample input data for DanQ,
- folder "checkpoints" that stores the pre-computed checkpoint files, and
- folder "predictions" with sample predicted results.
The following command will train the (default) DanQ model on the (default) data stored in the MAT file data/train.mat:
[user@cn4466 ~]$ train.py -d data ... Epoch 1/60 4400000/4400000 [==============================] - 3248s 738us/step - loss: 0.0759 - val_loss: 0.0629 Epoch 2/60 4400000/4400000 [==============================] - 3312s 753us/step - loss: 0.0677 - val_loss: 0.0574 Epoch 3/60 4400000/4400000 [==============================] - 3253s 739us/step - loss: 0.0652 - val_loss: 0.0560 Epoch 4/60 4400000/4400000 [==============================] - 3142s 714us/step - loss: 0.0635 - val_loss: 0.0547 Epoch 5/60 4400000/4400000 [==============================] - 3134s 712us/step - loss: 0.0624 - val_loss: 0.0542 Epoch 6/60 4400000/4400000 [==============================] - 3136s 713us/step - loss: 0.0616 - val_loss: 0.0545 ...The training will take approx. 50 min per one epoch when one GPU V100 is used. The result of the training (i.e. a checkpoint file) will be stored in the folder "checkpoints", in HDF5 format. The name of the checkpoint file will be
checkpoints/<.out_prefix>.<model_name>.h5.or, in this particular case, danq.DanQ.h5.
To train another model on the same data, specify the model name with command line option -m. For example:
[user@cn4466 ~]$ train.py -d data -m DeepSEA ...The command line options for the prediction code are:
[user@cn4466 ~]$ predict.py -h ... usage: predict.py [-h] -d data_folder [-f start_filters] [-k kernel_size] [-m model_name] [-M] [-o test_results] optional arguments: -h, --help show this help message and exit -f start_filters, --start_filters start_filters number of filters used in the (1st) convolution layer; default=320 -k kernel_size, --kernel_size kernel_size conv. kernel size; default=26 for DanQ and =8 for DeepSEA model -m model_name, --model_name model_name model name: DanQ | DeepSEA -M, --motif_sequences predict motif sequences, rather than target labels -o test_results, --output test_results output file with test results; default='test_results.h5' required arguments: -d data_folder, --data_folder data_folder path to the data folderUsing TensorFlow backend.To make predictions of the target labels with model DanQ on the testing data stored in the MAT file data/test.mat and using a pre-trained checkpoint file, type
[user@cn4466 ~]$ predict.py -d data ... 455024/455024 [==============================] - 982s 2ms/stepWith this command, the predict.py code will output the predicted results in the file test_results.h5.
In order to visualize the predicted ROC curves for any of one of available 919 targets, run the executable visualize.py,
[user@cn4466 ~]$ visualize.py -h Using TensorFlow backend. usage: visualize.py [-h] [-f start_filters] [-M] -t target_id optional arguments: -h, --help show this help message and exit -f start_filters, --start_filters start_filters number of filters used in the (1st) convolution layer; default=320 -M, --motif_sequence visualize a motif sequence, rather than a ROC curve required arguments: -t target_id, --target target_id integer in the interval [1,919] for visualizing a ROC curve and in [1,start_filters] for visualizing a motif sequencetogether with the option -t to specify a particular target. For example:
[user@cn4466 ~]$ visualize.py -t 1

To make predictions of motif sequences, first run predict.py with the option -M:
[user@cn4466 ~]$ predict.py -M -d dataThis command will produce a file motifs.txt. Then vusualize a particular motif by providing its id with -t option. For example:
[user@cn4466 ~]$ visualize.py -M -t 43

In order to train the DanQ code using multiple GPUs,
- allocate a session with appropriate number of GPUs (you are allowed to use up to 4 GPUs per session),
- specify through a command line option -g how many GPUs you want to use, and
- specify a batch size that is multiple of the number of GPUs you will be using.
For example:
[user@cn4471 ~]$ exit [user@biowulf ~] sinteractive --mem=64g --gres=gpu:v100:4,lscratch:100 --cpus-per-task=14 [user@cn4471 ~]$ module load danq [user@cn4471 ~]$ cp -r $DANQ_DATA/* . [user@cn4471 ~]$ train.py -d data -g 4 -b 2000End the interactive session:
[user@cn4466 ~]$ exit salloc.exe: Relinquishing job allocation 46116226 [user@biowulf ~]$
Batch job
Most jobs should be run as batch jobs.
Create a batch input file (e.g. danq.sh). For example:
#!/bin/bash module load DanQ cp -r $DANQ_DATA . train.py
Submit this job using the Slurm sbatch command.
sbatch [--cpus-per-task=#] [--mem=#] danq.sh