encode-atac-seq-pipeline on Biowulf

From the Encode documentation:

This pipeline is designed for automated end-to-end quality control and processing of ATAC-seq or DNase-seq data. The pipeline can be run on compute clusters with job submission engines or stand alone machines. It inherently makes uses of parallelized/distributed computing.
Documentation
Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Sample session:

A note about resource allocation:

WDL based workflows need a json file to define input and settings for a workflow run. In this example, we will use the 76nt data from ENCODE sample ENCSR356KRQ (keratinocyte). This includes 2 and 6 fastq files respectively for each of 2 replicates.

Continues to use the v3 annotation. However, caper apparently changed significantly so you should backup your old caper configuration and create fresh config files for this version.

[user@biowulf]$ sinteractive --cpus-per-task=8 --mem=20g --gres=lscratch:30
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144]$ wd=$PWD  # so we can copy results back later
[user@cn3144]$ cd /lscratch/$SLURM_JOB_ID
[user@cn3144]$ module load encode-atac-seq-pipeline/2.2.0
[user@cn3144]$ cp -Lr ${EASP_TEST_DATA:-none}/* .
[user@cn3144]$ tree
.
├── ENCSR356KRQ_subsampled.json.2.2.0
└── input
    └── ENCSR356KRQ
        ├── ENCFF007USV.subsampled.400.fastq.gz
        ├── ENCFF031ARQ.subsampled.400.fastq.gz
        ├── ENCFF106QGY.subsampled.400.fastq.gz
        ├── ENCFF193RRC.subsampled.400.fastq.gz
        ├── ENCFF248EJF.subsampled.400.fastq.gz
        ├── ENCFF341MYG.subsampled.400.fastq.gz
        ├── ENCFF366DFI.subsampled.400.fastq.gz
        ├── ENCFF368TYI.subsampled.400.fastq.gz
        ├── ENCFF573UXK.subsampled.400.fastq.gz
        ├── ENCFF590SYZ.subsampled.400.fastq.gz
        ├── ENCFF641SFZ.subsampled.400.fastq.gz
        ├── ENCFF734PEQ.subsampled.400.fastq.gz
        ├── ENCFF751XTV.subsampled.400.fastq.gz
        ├── ENCFF859BDM.subsampled.400.fastq.gz
        ├── ENCFF886FSC.subsampled.400.fastq.gz
        ├── ENCFF927LSG.subsampled.400.fastq.gz
        └── hg38.tsv

[user@cn3144]$ cat ENCSR356KRQ_subsampled.json.2.2.0
{
    "atac.pipeline_type" : "atac",
    "atac.genome_tsv" : "/fdb/encode-atac-seq-pipeline/v3/hg38/hg38.tsv",
    "atac.fastqs_rep1_R1" : [
        "input/ENCSR356KRQ/ENCFF341MYG.subsampled.400.fastq.gz",
        "input/ENCSR356KRQ/ENCFF106QGY.subsampled.400.fastq.gz"
    ],
    "atac.fastqs_rep1_R2" : [
        "input/ENCSR356KRQ/ENCFF248EJF.subsampled.400.fastq.gz",
        "input/ENCSR356KRQ/ENCFF368TYI.subsampled.400.fastq.gz"
    ],
    "atac.fastqs_rep2_R1" : [
        "input/ENCSR356KRQ/ENCFF641SFZ.subsampled.400.fastq.gz",
        "input/ENCSR356KRQ/ENCFF751XTV.subsampled.400.fastq.gz",
        "input/ENCSR356KRQ/ENCFF927LSG.subsampled.400.fastq.gz",
        "input/ENCSR356KRQ/ENCFF859BDM.subsampled.400.fastq.gz",
        "input/ENCSR356KRQ/ENCFF193RRC.subsampled.400.fastq.gz",
        "input/ENCSR356KRQ/ENCFF366DFI.subsampled.400.fastq.gz"
    ],
    "atac.fastqs_rep2_R2" : [
         "input/ENCSR356KRQ/ENCFF031ARQ.subsampled.400.fastq.gz",
         "input/ENCSR356KRQ/ENCFF590SYZ.subsampled.400.fastq.gz",
         "input/ENCSR356KRQ/ENCFF734PEQ.subsampled.400.fastq.gz",
         "input/ENCSR356KRQ/ENCFF007USV.subsampled.400.fastq.gz",
         "input/ENCSR356KRQ/ENCFF886FSC.subsampled.400.fastq.gz",
         "input/ENCSR356KRQ/ENCFF573UXK.subsampled.400.fastq.gz"
    ],
    "atac.paired_end" : true,
    "atac.auto_detect_adapter" : true,
    "atac.enable_xcor" : true,
    "atac.title" : "ENCSR356KRQ (subsampled 1/400)",
    "atac.description" : "ATAC-seq on primary keratinocytes in day 0.0 of differentiation"
}
        

In this example the pipeline will only be run locally - i.e. it will not submit tasks as slurm jobs. Follow the caper docs to set up a config file for slurm submission. This has to be done only once.

[user@cn3144]$ [[ -d ~/.caper ]] && mv ~/.caper ~/caper.$(date +%F).bak # back up old caper config
[user@cn3144]$ mkdir -p ~/.caper && caper init local
[user@cn3144]$ # note the need for --singularity in this version
[user@cn3144]$ caper run $EASP_WDL -i ENCSR356KRQ_subsampled.json.2.2.0 --singularity
[...much output...]
This workflow ran successfully. There is nothing to troubleshoot

This version of the pipeline comes with a tool to copy and organize pipeline output.

[user@cn3144]$ ls atac
a0fb9f58-ede3-4c02-9bcc-26d21ab5ccbb
[user@cn3144]$ croo --method copy --out-dir=${wd}/ENCSR889WQX \
    atac/a0fb9f58-ede3-4c02-9bcc-26d21ab5ccbb/metadata.json

Batch job
Most jobs should be run as batch jobs.

Create a batch input file. For example the following batch job will run a local job (assuming the caper config file is set up correctly):

#! /bin/bash

wd=$PWD
module load encode-atac-seq-pipeline/2.2.0 || exit 1

cd /lscratch/$SLURM_JOB_ID

mkdir input
cp -rL $EASP_TEST_DATA/* .
caper run $EASP_WDL -i ENCSR356KRQ_subsampled.json.2.2.0
rc=$?
croo --method copy --out-dir=${wd}/ENCSR356KRQ \
    atac/*/metadata.json
exit $rc

Submit this job using the Slurm sbatch command.

sbatch --time=4:00:00 --cpus-per-task=8 --mem=20g --gres=lscratch:50 encode-atac-seq-pipeline.sh