High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
atac_dnase_pipelines on Biowulf

This pipeline is designed for automated end-to-end quality control and processing of ATAC-seq or DNase-seq data. The installation mode of the pipeline on our system prevents the pipeline from running in cluster mode which means parallelization only happens within a single job - the pipeline cannot submit subjobs to slurm.

This pipeline has been deprecated and is being replaced with the WDL based Encode ATAC-seq pipeline.

Documentation
Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Sample session:

[user@biowulf]$ sinteractive --mem=30g --cpus-per-task=16 --gres=lscratch:100
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144 ~]$ module load atac_dnase_pipelines
[user@cn3144 ~]$ cd /lscratch/$SLURM_JOB_ID
[user@cn3144 ~]$ cp -L ${ATAC_DNASE_PIPELINES_TEST_DATA}/*.fastq.gz .
[user@cn3144 ~]$ tree
.
|-- ENCFF141NYG.fastq.gz
`-- ENCFF566LXM.fastq.gz

These files are PE100nt ENCODE ATAC-Seq data for human thoracic aorta (sample ENCFF566LXM). After setting up the test data, run the pipeline with

[user@cn3144 ~]$ atac_dnase_pipelines -species hg19 -nth $SLURM_CPUS_PER_TASK \
    -type atac-seq \
    -fastq1_1 ENCFF566LXM.fastq.gz \
    -fastq1_2 ENCFF141NYG.fastq.gz \
    -auto_detect_adapter \
    -ENCODE3
[...snip...]
[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$

This run will take several hours.

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. atac_dnase_pipelines.sh), which uses the input file 'atac_dnase_pipelines.in'. For example:

#!/bin/bash
# this file is atac_test.sh
module load atac_dnase_pipelines || exit 1

cd /lscratch/$SLURM_JOB_ID
cp ${ATAC_DNASE_PIPELINES_TEST_DATA}/*.fastq.gz .

echo "START: $(date)"
atac_dnase_pipelines -species hg19 -nth $SLURM_CPUS_PER_TASK \
    -type atac-seq \
    -fastq1_1 ENCFF566LXM.fastq.gz \
    -fastq1_2 ENCFF141NYG.fastq.gz \
    -auto_detect_adapter \
    -ENCODE3
echo "END: $(date)"

Submit this job using the Slurm sbatch command.

sbatch --cpus-per-task=16 --mem=40g --time=24:00:00 --gres=lscratch:50 atac_test.sh
Swarm of Jobs
A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.

Create a swarmfile (e.g. atac_dnase_pipelines.swarm). For example:

atac_dnase_pipelines -species hg19 -nth $SLURM_CPUS_PER_TASK \
    -type atac-seq \
    -fastq1_1 sample1_R1.fastq.gz \
    -fastq1_2 sample1_R2.fastq.gz \
    -auto_detect_adapter \
    -ENCODE3
atac_dnase_pipelines -species hg19 -nth $SLURM_CPUS_PER_TASK \
    -type atac-seq \
    -fastq1_1 sample2_R1.fastq.gz \
    -fastq1_2 sample2_R2.fastq.gz \
    -auto_detect_adapter \
    -ENCODE3
atac_dnase_pipelines -species hg19 -nth $SLURM_CPUS_PER_TASK \
    -type atac-seq \
    -fastq1_1 sample3_R1.fastq.gz \
    -fastq1_2 sample3_R2.fastq.gz \
    -auto_detect_adapter \
    -ENCODE3

Submit this job using the swarm command.

swarm -f atac_dnase_pipelines.swarm -g 40 -t 16 --gres=lscratch:50 --time=24:00:00 --module atac_dnase_pipelines