Biowulf High Performance Computing at the NIH
pyDNase on Biowulf

pyDNase is a set of python programs for analyzing DNase-Seq data.

References:

Documentation
Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Sample session:

[user@biowulf]$ sinteractive --cpus-per-task=4 --mem=10g --gres=lscratch:20
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job
[user@cn3144]$ module load pyDNase
[user@cn3144]$ cd /lscratch/$SLURM_JOB_ID
[user@cn3144]$ cp $PYDNASE_TEST_DATA/* .
# ENCFF001OJZ_chr19.bam: mouse splenic B cell DNAse-Seq, chr19 [mm9]
# ENCFF001OGK_chr19.bed: DNAse-Seq peaks
# ctcf_chr19.bed:        CTCF motif hits in chr19
[user@cn3144]$ ls -lh
total 372M
-rw-rw-r-- 1 user group  90K Jun 13 13:07 ctcf_chr19.bed
-rw-rw-r-- 1 user group 106K Jun 13 13:01 ENCFF001OGK_chr19.bed
-rw-rw-r-- 1 user group 370M Jun 13 13:01 ENCFF001OJZ_chr19.bam
-rw-rw-r-- 1 user group  88K Jun 13 13:01 ENCFF001OJZ_chr19.bam.bai

Call footprints with default settings

[user@cn3144]$ mkdir out
[user@cn3144]$ wellington_footprints.py -p $SLURM_CPUS_PER_TASK \
  ENCFF001OGK_chr19.bed ENCFF001OJZ_chr19.bam out
Reading BED File...
[################################] 4593/4593 - 00:00:00
Calculating footprints...
[################################] 4593/4593 - 00:03:44
Waiting for the last 40 jobs to finish...

[user@cn3144]$ head -n4 out/ENCFF001OJZ_chr19.bam.ENCFF001OGK_chr19.bed.WellingtonFootprints.FDR.0.01.bed
chr19   3208466 3208491 Unnamed4598     -188.1669158935547      +
chr19   3283036 3283057 Unnamed4606     -160.5414581298828      +
chr19   3292183 3292208 Unnamed4615     -141.6390838623047      +
chr19   3321015 3321040 Unnamed4623     -217.28445434570312     +

[user@cn3144]$  wc -l out/ENCFF001OJZ_chr19.bam.ENCFF001OGK_chr19.bed.WellingtonFootprints.FDR.0.01.bed
1482 out/ENCFF001OJZ_chr19.bam.ENCFF001OGK_chr19.bed.WellingtonFootprints.FDR.0.01.bed

So there should be 1482 footprints on chr19. Create wiggle tracks of DNase cuts for the peak areas only.

[user@cn3144]$ dnase_wig_tracks.py ENCFF001OGK_chr19.bed ENCFF001OJZ_chr19.bam \
    ENCFF001OJZ_chr19_fwd.wig ENCFF001OJZ_chr19_rev.wig 

Here is an example of the created wiggle tracks and footprint scores around a CTCF site on chr19:

CTCF footprint example

Next, determine which footprints overlap CTCF motifs and create a summary graph for those footprints:

[user@cn3144]$ module load bedtools
[user@cn3144]$ bedtools intersect -wa \
    -a out/ENCFF001OJZ_chr19.bam.ENCFF001OGK_chr19.bed.WellingtonFootprints.FDR.0.01.bed \
    -b ctcf_chr19.bed > ctcf_chr19_footprints.bed
[user@cn3144]$ dnase_average_profile.py \
    ctcf_chr19_footprints.bed ENCFF001OJZ_chr19.bam \
    ctcf_chr19_footprints_average.png
[user@cn3144]$ exit
[user@biowulf]$

Which generates the following figure:

CTCF footprint average

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. pyDNase.sh), which uses the input file 'pyDNase.in'. For example:

#! /bin/bash

module load pyDNase/0.2.6 || exit 1
wellington_footprints -p ${SLURM_CPUS_PER_TASK} peaks.bed aln.bam outdir

Submit this job using the Slurm sbatch command.

sbatch --cpus-per-task=6 --mem=10g pyDNase.sh
Swarm of Jobs
A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.

Create a swarmfile (e.g. pyDNase.swarm). For example:

wellington_footprints -p ${SLURM_CPUS_PER_TASK} peaks1.bed aln1.bam outdir1
wellington_footprints -p ${SLURM_CPUS_PER_TASK} peaks2.bed aln2.bam outdir2
wellington_footprints -p ${SLURM_CPUS_PER_TASK} peaks3.bed aln3.bam outdir3

Submit this job using the swarm command.

swarm -f pyDNase.swarm -g 10 -t 6 --module pyDNase
where
-g # Number of Gigabytes of memory required for each process (1 line in the swarm command file)
-t # Number of threads/CPUs required for each process (1 line in the swarm command file).
--module pyDNase Loads the pyDNase module for each subjob in the swarm