Biowulf High Performance Computing at the NIH
msCentipede on Biowulf

From the msCentipede page:

msCentipede is an algorithm for accurately inferring transcription factor binding sites using chromatin accessibility data (Dnase-seq, ATAC-seq). The hierarchical multiscale model underlying msCentipede identifies factor-bound genomic sites by using patterns in DNA cleavage resulting from the action of nucleases in open chromatin regions (regions typically bound by transcription factors)

References:

  • A. Raj, H. Shim, Y. Gilad, J. K. Pritchard and M. Stephens. msCentipede: Modeling Heterogeneity across Genomic Sites and Replicates Improves Accuracy in the Inference of Transcription Factor Binding PLoS ONE 10: e0138030. PubMed |  PMC |  Journal
Documentation
Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Sample session:

[user@biowulf]$ sinteractive --gres=lscratch:10 --mem=8g
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144]$ cd /lscratch/$SLURM_JOB_ID
[user@cn3144]$ module load mscentipede
[user@cn3144]$ cp -Lr ${MSCENTIPEDE_TEST_DATA:-none}/* .
[user@cn3144]$ ls -lh
total 2.1G
-rw-r--r-- 1 user group 6.7K Jun 15 08:04 Ctcf_chr10_motifs.txt.gz
drwxr-xr-x 2 user group 4.0K Jun 15 08:04 expected
-rw-r--r-- 1 user group 1.1G Jun 15 08:04 Gm12878_Rep1.bam
-rw-r--r-- 1 user group 2.9M Jun 15 08:04 Gm12878_Rep1.bam.bai
-rw-r--r-- 1 user group 960M Jun 15 08:04 Gm12878_Rep2.bam
-rw-r--r-- 1 user group 2.9M Jun 15 08:04 Gm12878_Rep2.bam.bai

Learn model parameters. This will create two files: CTCF_chr10_motifs_msCentipede_log.txt and CTCF_chr10_motifs_msCentipede_model_parameters.pkl

[user@cn3144]$ call_binding.py --task learn Ctcf_chr10_motifs.txt.gz \
  Gm12878_Rep1.bam Gm12878_Rep2.bam
loading motifs ...
num of motif sites = 471
loading read counts ...
transforming data into multiscale representation ...
starting model estimation (restart 1)
initial log likelihood = -1.17e+02
iteration 1: log likelihood = -7.26e+01, change in log likelihood = 4.46e+01, iteration time = 70.141 secs
iteration 2: log likelihood = -7.18e+01, change in log likelihood = 7.79e-01, iteration time = 64.089 secs
iteration 3: log likelihood = -7.16e+01, change in log likelihood = 2.08e-01, iteration time = 65.901 secs
iteration 4: log likelihood = -7.16e+01, change in log likelihood = 3.24e-03, iteration time = 67.967 secs
iteration 5: log likelihood = -7.16e+01, change in log likelihood = -2.91e-04, iteration time = 67.133 secs
iteration 6: log likelihood = -7.16e+01, change in log likelihood = 2.64e-03, iteration time = 68.707 secs
iteration 7: log likelihood = -7.16e+01, change in log likelihood = -1.30e-02, iteration time = 67.599 secs
iteration 8: log likelihood = -7.16e+01, change in log likelihood = 1.47e-02, iteration time = 70.582 secs
iteration 9: log likelihood = -7.16e+01, change in log likelihood = 2.17e-04, iteration time = 70.483 secs
iteration 10: log likelihood = -7.16e+01, change in log likelihood = 3.06e-04, iteration time = 69.323 secs
iteration 11: log likelihood = -7.16e+01, change in log likelihood = -3.05e-05, iteration time = 70.625 secs
iteration 12: log likelihood = -7.16e+01, change in log likelihood = 4.41e-05, iteration time = 69.663 secs
iteration 13: log likelihood = -7.16e+01, change in log likelihood = 2.02e-07, iteration time = 69.483 secs

Infer factor binding and plotting a factor specific cleavage profile

[user@cn3144]$ call_binding.py --task infer Ctcf_chr10_motifs.txt.gz \
  Gm12878_Rep1.bam Gm12878_Rep2.bam

[user@cn3144]$ plot_accessibility_profile.py Ctcf_chr10_motifs.txt.gz

[user@cn3144]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf]$
profile plot

Accessibility profile of CTCF sites

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. mscentipede.sh), which uses the input file 'mscentipede.in'. For example:

#!/bin/bash
module load mscentipeded || exit 1
cd /lscratch/$SLURM_JOB_ID || exit 1
cp -Lr ${MSCENTIPEDE_TEST_DATA:-none}/* .

call_binding.py --task learn Ctcf_chr10_motifs.txt.gz \
  Gm12878_Rep1.bam Gm12878_Rep2.bam
call_binding.py --task infer Ctcf_chr10_motifs.txt.gz \
  Gm12878_Rep1.bam Gm12878_Rep2.bam

Submit this job using the Slurm sbatch command.

sbatch --gres=lscratch:10 --mem=6g mscentipede.sh
Swarm of Jobs
A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.

Create a swarmfile (e.g. mscentipede.swarm). For example:

call_binding.py --task infer Ctcf_chr10_motifs.txt.gz \
  sample1_Rep1.bam sample1_Rep2.bam
call_binding.py --task infer Ctcf_chr10_motifs.txt.gz \
  sample2_Rep1.bam sample2_Rep2.bam

Submit this job using the swarm command.

swarm -f mscentipede.swarm [-g #] [-t #] --module mscentipede
where
-g # Number of Gigabytes of memory required for each process (1 line in the swarm command file)
-t # Number of threads/CPUs required for each process (1 line in the swarm command file).
--module mscentipede Loads the mscentipede module for each subjob in the swarm