Biowulf High Performance Computing at the NIH
DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences

DanQ is a hybrid convolutional and bi-directional long short-term memory recurrent neural network framework for predicting non-coding function de novo from sequence.

References:

Documentation
Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Sample session:

[user@biowulf]$ sinteractive --mem=45g --gres=gpu:v100,lscratch:10 --cpus-per-task=14
[user@cn4466 ~]$module load DanQ 
[+] Loading python 3.6  ... 
[+] Loading cuDNN 7.0  libraries... 
[+] Loading CUDA Toolkit  9.0.176  ... 
[+] Loading meme  5.0.1  on cn4466 
[+] Loading openmpi 2.1.1  for GCC 4.8.5 
[+] Loading DanQ 20190507  ... 
(a lower-case name, danq, for the module will also work).

The code comprises three executables: danq_train.py, danq_predict.py and danq_visualize.py. To display a usage message for a given executable, type its name, followed by the option "-h". To example:
[user@cn4466 ~]$ danq_train.py -h
Using TensorFlow backend.
usage: danq_train.py [-h] [-b batch_size] -d data_folder [-e num_epochs]
                     [-g num_gpus] [-l learning_rate] [-m model_name]
                     [-n debug_size] [-o optimizer] [-p out_prefix] [-s] [-v]
                     [-w in_prefix]

optional arguments:
  -h, --help            show this help message and exit
  -b batch_size, --bs batch_size
                        batch size; default=500
  -e num_epochs, --num_epochs num_epochs
                        number of epochs; default=60
  -g num_gpus, --num_gpus num_gpus
                        number of gpus to use; default=1
  -l learning_rate, --lr learning_rate
                        learning rate; default=1.e-4
  -m model_name, --model_name model_name
                        model name: DanQ | DeepSEA
  -n debug_size, --num_debug_data debug_size
                        number of training examples to use when debugging
  -o optimizer, --optimizer optimizer
                        optimizer: adam | rmsprop | sgd
  -p out_prefix, --checkpoint_prefix out_prefix
                        prefix of the output checkpoint file
                        ..h5
  -s, --schedule        vary the learning rate according to a schedule
  -v, --verbose         increase the verbosity level of output
  -w in_prefix, --pre-trained_weights in_prefix
                        prefix of the input checkpoint file
                        ..h5

required arguments:
  -d data_folder, --data_folder data_folder
                        path to the data folder
This code comprises two network models: DanQ and DeepSEA. In order to perform training any model, first download available sample data:
[user@cn4466 ~]$ cp -r $DANQ_DATA/*  .
This command will copy to your current directory
- folder "data" that contains sample input data for DanQ,
- folder "checkpoints" that stores the pre-computed checkpoint files, and
- folder "predictions" with sample predicted results.

The following command will train the (default) DanQ model on the (default) data stored in the MAT file data/train.mat:
[user@cn4466 ~]$ danq_train.py -d data
...
Epoch 1/60
4400000/4400000 [==============================] - 3248s 738us/step - loss: 0.0759 - val_loss: 0.0629
Epoch 2/60
4400000/4400000 [==============================] - 3312s 753us/step - loss: 0.0677 - val_loss: 0.0574
Epoch 3/60
4400000/4400000 [==============================] - 3253s 739us/step - loss: 0.0652 - val_loss: 0.0560
Epoch 4/60
4400000/4400000 [==============================] - 3142s 714us/step - loss: 0.0635 - val_loss: 0.0547
Epoch 5/60
4400000/4400000 [==============================] - 3134s 712us/step - loss: 0.0624 - val_loss: 0.0542
Epoch 6/60
4400000/4400000 [==============================] - 3136s 713us/step - loss: 0.0616 - val_loss: 0.0545
...
The training will take approx. 50 min per one epoch when one GPU V100 is used. The result of the training (i.e. a checkpoint file) will be stored in the folder "checkpoints", in HDF5 format. The name of the checkpoint file will be
checkpoints/<out_prefix>.<model_name>.h5i
or, in this particular case, danq.DanQ.h5.
To train another model on the same data, specify the model name with command line option -m. For example:
[user@cn4466 ~]$ danq_train.py -d data -m DeepSEA
...
The command line options for the prediction code are:
[user@cn4466 ~]$ danq_predict.py -h
Using TensorFlow backend.
usage: danq_predict.py [-h] -d data_folder [-m model_name] [-M]
                       [-o test_results] [-p in_prefix] [-s]

optional arguments:
  -h, --help            show this help message and exit
  -m model_name, --model_name model_name
                        model name: DanQ | DeepSEA
  -M, --motif           predict_motifs, rather than target labels
  -o test_results, --output test_results
                        output file with test results;
                        default='test_results.h5'
  -p in_prefix, --checkpoint_prefix in_prefix
                        prefix of the input checkpoint file
                        ..h5
  -s, --motif_seqs      predict motif sequences, rather than target labels

required arguments:
  -d data_folder, --data_folder data_folder
                        path to the data folder
To make predictions of the target labels with model DanQ on the testing data stored in the MAT file data/test.mat and using the "best" re-trained checkpoint file checkpoints/best.DanQ.h5, type
[user@cn4466 ~]$ danq_predict.py -d data -p best 
...
455024/455024 [==============================] - 982s 2ms/step
With this command, the danq_predict.py code will output the predicted results in the file test_results.h5.
In order to visualize the predicted ROC curves for any of one of available 919 targets, run the executable danq_visualize.py,
[user@cn4466 ~]$ danq_visualize.py -h 
Using TensorFlow backend.
usage: danq_visualize.py [-h] [-m model_name] [-s] -t target_id

optional arguments:
  -h, --help            show this help message and exit
  -m model_name, --model_name model_name
                        model name: DanQ | DeepSEA
  -s, --motif_sequence  visualize a motif sequence

required arguments:
  -t target_id, --target target_id
                        integer in [1,919] for ROC curve and [1,320] for motif
together with the option -t to specify a particular target. For example:
[user@cn4466 ~]$ danq_visualize.py -t 1 

To make predictions of motif sequences, first run danq_predict.py with the option -s:
[user@cn4466 ~]$ danq_predict.py -s -d data
This command will produce a file motifs.txt. Then vusualize a particular motif by providing its id with -t option. For example:
[user@cn4466 ~]$ danq_visualize.py -s -t 43 

In order to train the DanQ code using multiple GPUs,
- allocate a session with appropriate number of GPUs (you are allowed to use up to 4 GPUs per session),
- specify through a command line option -g how many GPUs you want to use, and
- specify a batch size that is multiple of the number of GPUs you will be using.
For example:
[user@cn4471 ~]$ exit
[user@biowulf ~] sinteractive --mem=64g --gres=gpu:v100:4,lscratch:100 --cpus-per-task=14 
[user@cn4471 ~]$ module load danq 
[user@cn4471 ~]$ cp -r $DANQ_DATA/* .
[user@cn4471 ~]$ danq_train.py -d data -g 4 -b 2000 
End the interactive session:
[user@cn4466 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$
Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. danq.sh). For example:

#!/bin/bash
module load DanQ 
cp -r $DANQ_DATA  .
danq_train.py        

Submit this job using the Slurm sbatch command.

sbatch [--cpus-per-task=#] [--mem=#] danq.sh