High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
pyDNase on Biowulf & Helix


pyDNase is a set of python programs for analyzing DNase-Seq data.

There may be multiple versions of pyDNase available. An easy way of selecting the version is to use modules. To see the modules available, type

module avail pyDNase 

To select a module use

module load pyDNase/[version]

where [version] is the version of choice.

pyDNase is a multithreaded/multiprocessing application. Make sure to match the number of cpus requested with the number of threads. If not provided, pyDNase will attempt to use all CPUs.

Environment variables set



Interactive job on Biowulf

Allocate an interactive session with sinteractive and analyze test data set from Encode (chromosome 19 of replicate 1 or experiment ENCSR000CMN:

biowulf$ sinteractive -c4 --mem=10g
node$ module load pyDNase
node$ cp /usr/local/apps/pyDNase/TEST_DATA/* .
# ENCFF001OJZ_chr19.bam: mouse splenic B cell DNAse-Seq, chr19 [mm9]
# ENCFF001OGK_chr19.bed: DNAse-Seq peaks
# ctcf_chr19.bed:        CTCF motif hits in chr19
node$ ls -lh
total 372M
-rw-rw-r-- 1 user group  90K Jun 13 13:07 ctcf_chr19.bed
-rw-rw-r-- 1 user group 106K Jun 13 13:01 ENCFF001OGK_chr19.bed
-rw-rw-r-- 1 user group 370M Jun 13 13:01 ENCFF001OJZ_chr19.bam
-rw-rw-r-- 1 user group  88K Jun 13 13:01 ENCFF001OJZ_chr19.bam.bai

Call footprints with default settings

node$ wellington_footprints.py -p $SLURM_CPUS_PER_TASK \
  ENCFF001OGK_chr19.bed ENCFF001OJZ_chr19.bam out
node $ less out/ENCFF001OJZ_chr19.bam.ENCFF001OGK_chr19_c123.bed.WellingtonFootprints.FDR.0.01.bed
chr19   3208466 3208491 Unnamed4598     -188.166915894  +
chr19   3292183 3292208 Unnamed4606     -141.639083862  +
chr19   3321015 3321040 Unnamed4616     -217.284454346  +
chr19   3283036 3283057 Unnamed4627     -160.54145813   +

node$  wc -l out/ENCFF001OJZ_chr19.bam.ENCFF001OGK_chr19_c123.bed.WellingtonFootprints.FDR.0.01.bed
1486 out/ENCFF001OJZ_chr19.bam.ENCFF001OGK_chr19_c123.bed.WellingtonFootprints.FDR.0.01.bed

So there should be 1486 footprints on chr19. Create wiggle tracks of DNase cuts for the peak areas only.

node$ dnase_wig_tracks.py ENCFF001OGK_chr19_c123.bed ENCFF001OJZ_chr19.bam \
    ENCFF001OJZ_chr19_fwd.wig ENCFF001OJZ_chr19_rev.wig 

Here is an example of the created wiggle tracks and footprint scores around a CTCF site on chr19:

CTCF footprint example

Next, determine which footprints overlap CTCF motifs and create a summary graph for those footprints:

node$ bedtools intersect -wa \
    -a out/ENCFF001OJZ_chr19.bam.ENCFF001OGK_chr19_c123.bed.WellingtonFootprints.FDR.0.01.bed \
    -b ctcf_chr19.bed > ctcf_chr19_footprints.bed
node$ dnase_average_profile.py \
    ctcf_chr19_footprints.bed ENCFF001OJZ_chr19.bam \
node$ exit

Which generates the following figure:

CTCF footprint average
Batch job on Biowulf

Create a batch script similar to the following example:

#! /bin/bash
# this file is pydnase.sh

module load pyDNase || exit 1
wellington_footprints -p ${SLURM_CPUS_PER_TASK} peaks.bed aln.bam outdir

Submit to the queue with sbatch:

biowulf$ sbatch --cpus-per-task=6 --mem=10g pydnase.sh
Swarm of jobs on Biowulf

Create a swarm command file similar to the following example:

# this file is pydnase.swarm
wellington_footprints -p ${SLURM_CPUS_PER_TASK} peaks1.bed aln1.bam outdir1
wellington_footprints -p ${SLURM_CPUS_PER_TASK} peaks2.bed aln2.bam outdir2
wellington_footprints -p ${SLURM_CPUS_PER_TASK} peaks3.bed aln3.bam outdir3

And submit to the queue with swarm

biowulf$ swarm -f pydnase.swarm -t 6 -g 10 --module pyDNase