encode-atac-seq-pipeline on Biowulf

From the Encode documentation:

This pipeline is designed for automated end-to-end quality control and processing of ATAC-seq or DNase-seq data. The pipeline can be run on compute clusters with job submission engines or stand alone machines. It inherently makes uses of parallelized/distributed computing.
Important Notes

Interactive job
Allocate an interactive session and run the program. Sample session:

[user@biowulf]$ sinteractive --cpus-per-task=12 --mem=16g --gres=lscratch:20
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144]$ wd=$PWD  # so we can copy results back later
[user@cn3144]$ cd /lscratch/$SLURM_JOB_ID
[user@cn3144]$ module load encode-atac-seq-pipeline
[user@cn3144]$ cp -Lr $EASP_TEST_DATA/* .
[user@cn3144]$ tree
|-- [user    738]  ENCSR889WQX.json
`-- [user   4.0K]  input
    |-- [user   4.0K]  rep1
    |   |-- [user   213M]  ENCFF439VSY_5M.fastq.gz
    |   `-- [user   213M]  ENCFF683IQS_5M.fastq.gz
    `-- [user   4.0K]  rep2
        |-- [user   210M]  ENCFF463QCX_5M.fastq.gz
        `-- [user   211M]  ENCFF992TSA_5M.fastq.gz

WDL based workflows need a json file to define input and settings for a workflow run. In this example, we will use the 50nt data from ENCODE sample ENCSR889WQX (mouse frontal cortex). This includes 2 fastq files for each of 2 replicates.

[user@cn3144]$ cat ENCSR889WQX.json
    "atac.pipeline_type" : "atac",
    "atac.genome_tsv" : "/fdb/encode-atac-seq-pipeline/mm10/mm10.tsv",
    "atac.fastqs" : [
    "atac.paired_end" : false,
    "atac.multimapping" : 4,
    "atac.trim_adapter.auto_detect_adapter" : true,
    "atac.smooth_win" : 73,
    "atac.enable_idr" : true,
    "atac.idr_thresh" : 0.05,
    "" : "ENCSR889WQX (subsampled to 5M reads)",
    "atac.qc_report.desc" : "ATAC-seq on Mus musculus C57BL/6 frontal cortex adult"

In this example the pipeline will only be run locally - i.e. it will not submit tasks as slurm jobs

[user@cn3144]$ java -Dconfig.file=$EASP_BACKEND_CONF \
                      -Dbackend.default=Local \
                      -jar $CROMWELL_JAR run -i ENCSR889WQX.json $EASP_WDL
[...much output...]
[user@cn3144]$ ls -lh
drwxrwxr-x 3 user group 4.0K Sep 18 17:46 cromwell-executions
drwxrwxrwx 2 user group 4.0K Sep 18 19:39 cromwell-workflow-logs
-rw-r--r-- 1 user group  738 Sep 18 15:47 ENCSR889WQX.json 
drwxr-xr-x 4 user group 4.0K Sep 18 15:45 input

The pipeline outputs can be found in cromwell-executions with an idiosyncratic naming scheme. This directory contains a lot of hard links, so links have to be preserved when copying back to /data or the size of the folder will increase substantially.

[user@cn3144]$ cp -ra cromwell-executions $wd

[user@cn3144]$ exit
salloc.exe: Relinquishing job allocation 46116226

Batch job
Create a batch input file (e.g., which uses the input file ''. For example:

#! /bin/bash

module load encode-atac-seq-pipeline || exit 1

cd /lscratch/$SLURM_JOB_ID

mkdir input
cp -rL $EASP_TEST_DATA/* .

java -Dconfig.file=$EASP_BACKEND_CONF \
     -Dbackend.default=Local \
     -jar $CROMWELL_JAR run -i ENCSR889WQX.json $EASP_WDL

# need the -a since there are a lot of hard links in the executions
# directory
cp -ar cromwell-executions $wd/results

exit $rc

Submit this job using the Slurm sbatch command.

sbatch --time=4:00:00 --cpus-per-task=12 --mem=16g --gres=lscratch:50