Biowulf High Performance Computing at the NIH
encode-atac-seq-pipeline on Biowulf

From the Encode documentation:

This pipeline is designed for automated end-to-end quality control and processing of ATAC-seq or DNase-seq data. The pipeline can be run on compute clusters with job submission engines or stand alone machines. It inherently makes uses of parallelized/distributed computing.
Documentation
Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Sample session:


[user@biowulf]$ sinteractive --cpus-per-task=12 --mem=16g --gres=lscratch:20
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144]$ wd=$PWD  # so we can copy results back later
[user@cn3144]$ cd /lscratch/$SLURM_JOB_ID
[user@cn3144]$ module load encode-atac-seq-pipeline
[user@cn3144]$ cp -Lr $EASP_TEST_DATA/* .
[user@cn3144]$ tree
.
|-- [user    738]  ENCSR889WQX.json
`-- [user   4.0K]  input
    |-- [user   4.0K]  rep1
    |   |-- [user   213M]  ENCFF439VSY_5M.fastq.gz
    |   `-- [user   213M]  ENCFF683IQS_5M.fastq.gz
    `-- [user   4.0K]  rep2
        |-- [user   210M]  ENCFF463QCX_5M.fastq.gz
        `-- [user   211M]  ENCFF992TSA_5M.fastq.gz

WDL based workflows need a json file to define input and settings for a workflow run. In this example, we will use the 50nt data from ENCODE sample ENCSR889WQX (mouse frontal cortex). This includes 2 fastq files for each of 2 replicates.

[user@cn3144]$ cat ENCSR889WQX.json
{
    "atac.pipeline_type" : "atac",
    "atac.genome_tsv" : "/fdb/encode-atac-seq-pipeline/mm10/mm10.tsv",
    "atac.fastqs" : [
        [
            ["input/rep1/ENCFF683IQS_5M.fastq.gz"],
            ["input/rep1/ENCFF439VSY_5M.fastq.gz"]
        ],
        [
            ["input/rep2/ENCFF992TSA_5M.fastq.gz"],
            ["input/rep2/ENCFF463QCX_5M.fastq.gz"]
        ]
    ],
    "atac.paired_end" : false,
    "atac.multimapping" : 4,
    "atac.trim_adapter.auto_detect_adapter" : true,
    "atac.smooth_win" : 73,
    "atac.enable_idr" : true,
    "atac.idr_thresh" : 0.05,
    "atac.qc_report.name" : "ENCSR889WQX (subsampled to 5M reads)",
    "atac.qc_report.desc" : "ATAC-seq on Mus musculus C57BL/6 frontal cortex adult"
}

In this example the pipeline will only be run locally - i.e. it will not submit tasks as slurm jobs

[user@cn3144]$ java -Dconfig.file=$EASP_BACKEND_CONF \
                      -Dbackend.default=Local \
                      -jar $CROMWELL_JAR run -i ENCSR889WQX.json $EASP_WDL
[...much output...]
[user@cn3144]$ ls -lh
drwxrwxr-x 3 user group 4.0K Sep 18 17:46 cromwell-executions
drwxrwxrwx 2 user group 4.0K Sep 18 19:39 cromwell-workflow-logs
-rw-r--r-- 1 user group  738 Sep 18 15:47 ENCSR889WQX.json 
drwxr-xr-x 4 user group 4.0K Sep 18 15:45 input

The pipeline outputs can be found in cromwell-executions with an idiosyncratic naming scheme. This directory contains a lot of hard links, so links have to be preserved when copying back to /data or the size of the folder will increase substantially.


[user@cn3144]$ cp -ra cromwell-executions $wd

[user@cn3144]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf]$

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. encode-atac-seq-pipeline.sh), which uses the input file 'encode-atac-seq-pipeline.in'. For example:

#! /bin/bash

wd=$PWD
module load encode-atac-seq-pipeline || exit 1

cd /lscratch/$SLURM_JOB_ID

mkdir input
cp -rL $EASP_TEST_DATA/* .

java -Dconfig.file=$EASP_BACKEND_CONF \
     -Dbackend.default=Local \
     -jar $CROMWELL_JAR run -i ENCSR889WQX.json $EASP_WDL
rc=$?

# need the -a since there are a lot of hard links in the executions
# directory
cp -ar cromwell-executions $wd/results

exit $rc

Submit this job using the Slurm sbatch command.

sbatch --time=4:00:00 --cpus-per-task=12 --mem=16g --gres=lscratch:50 encode-atac-seq-pipeline.sh