DeepSea on Biowulf

DeepSEA is a deep learning-based algorithmic framework for predicting the chromatin effects of sequence alterations with single nucleotide sensitivity. DeepSEA can accurately predict the epigenetic state of a sequence, including transcription factors binding, DNase I sensitivities and histone marks in multiple cell types, and further utilize this capability to predict the chromatin effects of sequence variants and prioritize regulatory variants.

References:

Documentation
Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program.
Sample session (user input in bold):

[user@biowulf ~]$ sinteractive --constraint=x2650 --cpus-per-task=32 --mem=10g --ntasks=1 --exclusive --gres=lscratch:10
salloc.exe: Pending job allocation 47562568
salloc.exe: job 47562568 queued and waiting for resources
salloc.exe: job 47562568 has been allocated resources
salloc.exe: Granted job allocation 47562568
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn0236 are ready for job
srun: error: x11: no local DISPLAY defined, skipping

[user@cn0236 ~]$ export TMPDIR=/lscratch/${SLURM_JOB_ID}

[user@cn0236 ~]$ mkdir /lscratch/${SLURM_JOB_ID}/test

[user@cn0236 ~]$ cd /lscratch/${SLURM_JOB_ID}/test

[user@cn0236 test]$ cp /fdb/deepsea/0.94c/example* .

[user@cn0236 test]$ module load deepsea
[+] Loading deepsea  0.94c  on cn0236
[+] Loading singularity  3.5.2  on cn0236

[user@cn0236 test]$ rundeepsea.py example.vcf out1
Successfully copied input to working directory /tmp/tmpIPQUab
Loading required package: BSgenome.Hsapiens.UCSC.hg19
Loading required package: BSgenome
Loading required package: methods
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, xtabs

The following objects are masked from 'package:base':

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,
    as.data.frame, as.vector, cbind, colnames, do.call, duplicated,
    eval, evalq, get, grep, grepl, intersect, is.unsorted, lapply,
    lengths, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
    pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,
    tapply, union, unique, unlist, unsplit

Loading required package: S4Vectors
Loading required package: stats4
Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: GenomicRanges
Loading required package: Biostrings
Loading required package: XVector
Loading required package: rtracklayer
RangedData with 4 rows and 3 value columns across 3 spaces
     space                 ranges |         ori         mut
  <factor>              <IRanges> | <character> <character>
1     chr1 [109817041, 109818140] |           G           T
2    chr10 [ 23507814,  23508913] |           A           G
3    chr16 [   209160,    210259] |           T           C
4    chr16 [ 52598639,  52599738] |           C           T
                            name
                     <character>
1  [known_CEBP_binding_increase]
2 [known_FOXA2_binding_decrease]
3 [known_GATA1_binding_increase]
4 [known_FOXA1_binding_increase]
Number of valid variants:
4
Number of input variants:
4
Successfully converted to input format
Processing options
Switch to float
1
Processing options
Switch to float
1
Finished running DeepSEA. Now prepare output files...
[W::hts_idx_load2] The index file is older than the data file: ./resources/phastCons/primates_nohuman.tsv.gz.tbi
(4, 4)
Finished creating output file. Now clean up...
Everything done.

[user@cn0236 test]$ ls out1/
infile.vcf.out.alt     infile.vcf.out.logfoldchange  infile.vcf.wt1100.fasta.ref.vcf.evoall
infile.vcf.out.diff    infile.vcf.out.ref            infile.vcf.wt1100.fasta.ref.vcf.evo.evalues
infile.vcf.out.evalue  infile.vcf.out.snpclass
infile.vcf.out.funsig  infile.vcf.out.summary

[user@cn0236 test]$ rundeepsea-insilicomut.py example.vcf out2 469 #this will try to use all of the CPUs on a node
Successfully copied input to working directory /tmp/tmpCLZ7CN
Loading required package: BSgenome.Hsapiens.UCSC.hg19
Loading required package: BSgenome
Loading required package: methods
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, xtabs

The following objects are masked from 'package:base':

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,
    as.data.frame, as.vector, cbind, colnames, do.call, duplicated,
    eval, evalq, get, grep, grepl, intersect, is.unsorted, lapply,
    lengths, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
    pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,
    tapply, union, unique, unlist, unsplit

Loading required package: S4Vectors
Loading required package: stats4
Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: GenomicRanges
Loading required package: Biostrings
Loading required package: XVector
Loading required package: rtracklayer
RangedData with 4 rows and 3 value columns across 3 spaces
     space                 ranges |         ori         mut
  <factor>              <IRanges> | <character> <character>
1     chr1 [109817041, 109818140] |           G           T
2    chr10 [ 23507814,  23508913] |           A           G
3    chr16 [   209160,    210259] |           T           C
4    chr16 [ 52598639,  52599738] |           C           T
                            name
                     <character>
1  [known_CEBP_binding_increase]
2 [known_FOXA2_binding_decrease]
3 [known_GATA1_binding_increase]
4 [known_FOXA1_binding_increase]
Number of valid variants:
4
Number of input variants:
4
Successfully converted to input format
Processing options
Switch to float
1
Processing options
Switch to float
1
1025
2049
3073
4097
5121
6145
7169
8193
9217
10241
11265
12289
13313
14337
15361
16385
17409
18433
19457
20481
21505
22529
23553
Finished running DeepSEA. Now prepare output files...
/opt/conda/lib/python2.7/site-packages/matplotlib/collections.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  if self._edgecolors == str('face'):
Finished creating output file. Now clean up...
Everything done.

[user@cn0236 test]$ ls out2/
colorbar.png  log2foldchange_profile.csv  preview.png  vis.png

[user@cn0236 test]$ cp -r out* /data/${USER}

[user@cn0236 test]$ exit
exit
salloc.exe: Relinquishing job allocation 47562568

[user@biowulf ~]$ 

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. deepsea.sh). For example:

#!/bin/bash
set -e
module load deepsea
export TMPDIR=/lscratch/$SLURM_JOB_ID
rundeepsea-insilicomut.py example.vcf out

Submit this job using the Slurm sbatch command. As in the example above, be sure to allocate a node exclusively. For instance, you could submit this job using the following command

sbatch --constraint=x2650 --cpus-per-task=32 --mem=10g --ntasks=1 --exclusive --gres=lscratch:10 deepsea.sh
Swarm of Jobs
A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.

Create a swarmfile (e.g. deepsea.swarm). For example:

export TMPDIR=/lscratch/$SLURM_JOB_ID; rundeepsea-insilicomut.py example1.vcf out1
export TMPDIR=/lscratch/$SLURM_JOB_ID; rundeepsea-insilicomut.py example2.vcf out2
export TMPDIR=/lscratch/$SLURM_JOB_ID; rundeepsea-insilicomut.py example3.vcf out3
export TMPDIR=/lscratch/$SLURM_JOB_ID; rundeepsea-insilicomut.py example4.vcf out4

Submit this job using the swarm command. If you are using the rundeepsea-insilicomut.py script, be sure to allocate the nodes exclusively. For example:

swarm -f deepsea.swarm -g 10 -t 32 --module deepsea --sbatch "--constraint=x2650 --ntasks=1 --exclusive --gres=lscratch:10"