Biowulf High Performance Computing at the NIH
DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences

DanQ is a hybrid convolutional and bi-directional long short-term memory recurrent neural network framework for predicting non-coding function de novo from sequence.


Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Sample session:

[user@biowulf]$ sinteractive --mem=45g --gres=gpu:v100,lscratch:10 -c14
[user@@cn3107 ~]$module load DanQ python/3.5 cuDNN/7.0/CUDA-9.0 CUDA/9.0 
Copy the DanQ python scripts and sample data to the current folder:
[user@@cn3107 ~]$ cp $DANQ_SRC/* .
[user@@cn3107 ~]$ cp -r $DANQ_DATA  .
In order to perform training of the updated DanQ model, which makes use of Tensorflow as a backend for Keras, together with Keras support for modeling the bi-directional Long Short-Term Memory (LSTM) network, run the following command:
[user@@cn3107 ~]$ python -e 2 
A model file with the default name "MODEL.hdf5" fill be produced. For simplicity, this command was run with only 2 epochs, which is not sufficient for a proper training: according to the original DanQ publication, at least 60 epochs may be needed.

To view all the available training command line options, type:
[user@@cn3107 ~]$ python -h 
    python [options (-h to list)]

  --version             show program's version number and exit
  -h, --help            show this help message and exit
 -b batch_size, --batch_size=batch_size
                        batch size, default=500
  -e num_epochs, --num_epochs=num_epochs
                        number of epochs, default=60
  -g num_gpus, --num_gpus=num_gpus
                        number of gpus to use, default=1
  -m model_name, --model_name=model_name
                        name of the file in which the trained model will be
                        stored, default='MODEL.hdf5'
  -n debug_size, --num_debug_data=debug_size
                        number of training examples to use for debugging

To test the trained model, run the command:
[user@@cn3107 ~]$ python 
saving to predict.hdf5              
In the original version of the DanQ code, Theano was used as a backend for Keras and a seya code for modeling the bi-directional LSTM.

To run the original code, prepend your commands with "danq". In this case, a singularity container environment with pre-installed versions of Theano and seya will be used. For example, for training:
[user@@cn3107 ~]$ danq python
Using GPU implementation
loading data
building model
compiling model
running at most 60 epochs
Train on 4400000 samples, validate on 8000 samples
Epoch 1/60
    100/4400000 [..............................] - ETA: 1174786s - loss: 0.6940     
    200/4400000 [..............................] - ETA: 1161186s - loss: 0.6787     
    300/4400000 [..............................] - ETA: 1153634s - loss: 0.6611     
    400/4400000 [..............................] - ETA: 1152997s - loss: 0.6379
The trained model will be saved in file 'DanQ_bestmodel.hdf5'.

For testing:
[user@@cn3107 ~]$ danq python data/test.mat predict.hdf5
Using GPU implementation
building model
compiling model

End the interactive session:
[user@cn3107 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$
Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. For example:

module load DanQ 
cp $DANQ_SRC/* .
cp -r $DANQ_DATA  .

Submit this job using the Slurm sbatch command.

sbatch [--cpus-per-task=#] [--mem=#]