Biowulf High Performance Computing at the NIH
drompa on Biowulf

From the DROMPA manual:

DROMPA (DRaw and Observe Multiple enrichment Profiles and Annotation) is a program for user-friendly and flexible ChIP-seq pipelining. DROMPA can be used for quality check, PCRbias filtering, normalization, peak calling, visualization and other multiple analyses of ChIP-seq data. DROMPA is specially designed so that it is easy to handle, and for users without a strong bioinformatics background.

References:

  • R. Nakato and K. Shirahige. Statistical Analysis and Quality Assessment of ChIP-seq Data with DROMPA. In: Muzi-Falconi M., Brown G. (eds) Genome Instability. Methods in Molecular Biology, vol 1672. Humana Press, New York, NY. PubMed
Documentation
Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Sample session:

[user@biowulf]$ sinteractive --mem=5g --gres=lscratch:10
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144]$ module load drompa
[user@cn3144]$ cd /lscratch/$SLURM_JOB_ID
[user@cn3144]$ cp -L ${DROMPA_TEST_DATA:-none}/* .
[user@cn3144]$ ls -lh
total 858M
-rw-r--r-- 1 user group 387M Sep 24 15:22 ENCFF001NGB.bam
-rw-r--r-- 1 user group 458M Sep 24 15:22 ENCFF001NHS.bam
-rw-r--r-- 1 user group   72 Sep 24 15:22 README
-rw-r----- 1 user group  10M Sep 24 15:22 refFlat.txt

Create a tab separated file with chromosome sizes. drompa comes with a script to generate this from the fasta file of a genome (makegenometable.pl) but that would be inefficient in this case - we can get it from the bam header

[user@cn3144]$ samtools view -H ENCFF001NGB.bam \
                  | grep '@SQ' \
                  | tr ':' ' ' \
                  | awk '{printf("%s\t%s\n", $3, $5)}' \
                  > genome.tab

Parse the CTCF ChIP-Seq (ENCFF001NGB.bam) and IgG control (ENCFF001NHS.bam) into bin data

[user@cn3144]$ parse2wig -i ENCFF001NGB.bam -f BAM -o ENCFF001NGB -gt genome.tab
======================================
parse2wig version 3.5.0

Input file: ENCFF001NGB.bam
        Format: BAM, SINGLE_END
Output file: parse2wigdir/ENCFF001NGB
        Format: BINARY
Genome-table file: genome.tab
Binsize: 100 bp
Fragment length: 150
PCR bias filtering: ON
Read number for library complexity: 10.0 M

Total read normalization: NONE
======================================
Parsing ENCFF001NGB.bam...
done.

Checking redundant reads: redundancy threshold >1
[...snip...]
[user@cn3144]$ parse2wig -i ENCFF001NHS.bam -f BAM -o ENCFF001NHS -gt genome.tab
[...snip...]
[user@cn3144]$ ls -lh
total 858M
-rw-r--r-- 1 user group 387M Sep 24 15:22 ENCFF001NGB.bam
-rw-r--r-- 1 user group 458M Sep 24 15:22 ENCFF001NHS.bam
-rw-r--r-- 1 user group  317 Sep 24 15:32 genome.tab
drwxr-xr-x 2 user group 4.0K Sep 24 15:33 parse2wigdir
-rw-r--r-- 1 user group   72 Sep 24 15:22 README
-rw-r----- 1 user group  10M Sep 24 15:22 refFlat.txt

The binned data files are stored in parse2wigdir by default though that can be changed. Next call peaks

[user@cn3144]$ drompa_peakcall PC_SHARP \
                      -i parse2wigdir/ENCFF001NGB,parse2wigdir/ENCFF001NHS \
                      -p ChIPpeak -gt genome.tab -norm 2
======== PC_SHARP ========
drompa_peakcall version 3.5.0
   output prefix: ChIPpeak
   genometable: genome.tab
   ChIP: parse2wigdir/ENCFF001NGB
   Input: parse2wigdir/ENCFF001NHS
   Input format: BINARY
   binsize: 100
   smoothing width: 500 bp
   Peak intensity threshold: 0.00
   ChIP/Input normalization: NCIS
   Enrichment threshold: 0.00
   p-value threshold (internal, -log10): 4.00e+00
   p-value threshold (enrichment, -log10): 4.00e+00
   q-value threshold: 1.00e-03
======================================
chr=1
sample1: genome: ChIP read=16117944, Input read=17440328, ChIP/Input = 0.66
[...snip...]

[user@cn3144]$ wc -l ChIPpeak.bed
43117 ChIPpeak.bed

Then, drompa_draw can be used to generate a number of different plots based on the data.

[user@cn3144]$ module load R
[user@cn3144]$ echo -e "chr6\t122410087\t122610086" > region
[user@cn3144]$ drompa_draw PC_SHARP -p ChIPseq -gt genome.tab \
                      -i parse2wigdir/ENCFF001NGB,parse2wigdir/ENCFF001NHS,CTCF \
                      -gene refFlat.txt -show_itag 2 -ls 200 -lpp 2 -scale_tag 30 -r region
[user@cn3144]$ drompa_draw PROFILE -p peaks -gt genome.tab \
                      -i parse2wigdir/ENCFF001NGB,parse2wigdir/ENCFF001NHS,CTCF \
                      -ptype 4 -bed ChIPpeak.bed -stype 1

[user@cn3144]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf]$

Which creates the following PDF files: PC_SHARP and PROFILE

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. drompa.sh), which uses the input file 'drompa.in'. For example:

#! /bin/bash
module load drompa/3.5.0 || exit 1
wd=$PWD
cd /lscratch/$SLURM_JOB_ID

cp -L ${DROMPA_TEST_DATA:-none}/* .
parse2wig -i ENCFF001NGB.bam -f BAM -o ENCFF001NGB -gt genome.tab
parse2wig -i ENCFF001NHS.bam -f BAM -o ENCFF001NHS -gt genome.tab
drompa_peakcall PC_SHARP \
    -i parse2wigdir/ENCFF001NGB,parse2wigdir/ENCFF001NHS \
    -p ChIPpeak -gt genome.tab -norm 2

mv ChIPpeak.* $wd

Submit this job using the Slurm sbatch command.

sbatch --cpus-per-task=2 --mem=5g drompa.sh
Swarm of Jobs
A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.

Create a swarmfile (e.g. drompa.swarm). For example:

parse2wig -i sample1.bam -f BAM -o sample1 -gt genome.tab
parse2wig -i sample2.bam -f BAM -o sample2 -gt genome.tab
parse2wig -i sample3.bam -f BAM -o sample3 -gt genome.tab

Submit this job using the swarm command.

swarm -f drompa.swarm -g 5 -t 1 --module drompa/3.5.0
where
-g # Number of Gigabytes of memory required for each process (1 line in the swarm command file)
-t # Number of threads/CPUs required for each process (1 line in the swarm command file).
--module drompa Loads the drompa module for each subjob in the swarm