encode-atac-seq-pipeline on Biowulf

Quick Links

From the Encode documentation:

This pipeline is designed for automated end-to-end quality control and processing of ATAC-seq or DNase-seq data. The pipeline can be run on compute clusters with job submission engines or stand alone machines. It inherently makes uses of parallelized/distributed computing.

Documentation

Description on the Encode site
On GitHub
Input Json description
Output description

Important Notes

Module Name: encode-atac-seq-pipeline (see the modules page for more information)
This pipeline has undergone frequent updates with backwards incompatible changes to execution and configuration. If in doubt please refer back to the upstream documentation.
For local runs the CPU and memory consumption varies over time and its magnitude depends on the number of replicates and size of input data. For the example below (2 fastq files per replicaate, 2 replicates), 10-12 CPUs and 16GB of memory were sufficient.
As of version 2.0, we no longer provide a Slurm backend configuration. We recommend use of this pipeline in 'local' backend mode. If you believe you have a use for Slurm backend execution, please get in touch with HPC staff.
Environment variables set (not all of them are set in each version due to significant changes in the pipeline):
- $EASP_BACKEND_CONF: configuration for local backend
- $EASP_WFOPTS: singularity backend opts (Versions > 1.0 only)
- $EASP_WDL: WDL file defining the workflow
- $EASP_TEST_DATA: Input data for example below
Reference data in /fdb/encode-atac-seq-pipeline/<version>

Interactive job

Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Sample session:

A note about resource allocation:

Set the number of concurrent tasks (NUM_CONCURRENT_TASKS) to the number of replicates
Set "atac.bowtie2_cpu" in the input json to the number of CPUs you want bowtie2 to use. Usually 8 or so.
Allocate NUM_CONCURRENT_TASKS * atac.bowtie2_cpu CPUs
Allocate 20GB * NUM_CONCURRENT_TASKS for big samples and 10GB * NUM_CONCURRENT_TASKS for small samples

WDL based workflows need a json file to define input and settings for a workflow run. In this example, we will use the 76nt data from ENCODE sample ENCSR356KRQ (keratinocyte). This includes 2 and 6 fastq files respectively for each of 2 replicates.

2.2.0

Continues to use the v3 annotation. However, caper apparently changed significantly so you should backup your old caper configuration and create fresh config files for this version.

[user@biowulf]$ sinteractive --cpus-per-task=8 --mem=20g --gres=lscratch:30
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144]$ wd=$PWD  # so we can copy results back later
[user@cn3144]$ cd /lscratch/$SLURM_JOB_ID
[user@cn3144]$ module load encode-atac-seq-pipeline/2.2.0
[user@cn3144]$ cp -Lr ${EASP_TEST_DATA:-none}/* .
[user@cn3144]$ tree
.
├── ENCSR356KRQ_subsampled.json.2.2.0
└── input
    └── ENCSR356KRQ
        ├── ENCFF007USV.subsampled.400.fastq.gz
        ├── ENCFF031ARQ.subsampled.400.fastq.gz
        ├── ENCFF106QGY.subsampled.400.fastq.gz
        ├── ENCFF193RRC.subsampled.400.fastq.gz
        ├── ENCFF248EJF.subsampled.400.fastq.gz
        ├── ENCFF341MYG.subsampled.400.fastq.gz
        ├── ENCFF366DFI.subsampled.400.fastq.gz
        ├── ENCFF368TYI.subsampled.400.fastq.gz
        ├── ENCFF573UXK.subsampled.400.fastq.gz
        ├── ENCFF590SYZ.subsampled.400.fastq.gz
        ├── ENCFF641SFZ.subsampled.400.fastq.gz
        ├── ENCFF734PEQ.subsampled.400.fastq.gz
        ├── ENCFF751XTV.subsampled.400.fastq.gz
        ├── ENCFF859BDM.subsampled.400.fastq.gz
        ├── ENCFF886FSC.subsampled.400.fastq.gz
        ├── ENCFF927LSG.subsampled.400.fastq.gz
        └── hg38.tsv

[user@cn3144]$ cat ENCSR356KRQ_subsampled.json.2.2.0
{
    "atac.pipeline_type" : "atac",
    "atac.genome_tsv" : "/fdb/encode-atac-seq-pipeline/v3/hg38/hg38.tsv",
    "atac.fastqs_rep1_R1" : [
        "input/ENCSR356KRQ/ENCFF341MYG.subsampled.400.fastq.gz",
        "input/ENCSR356KRQ/ENCFF106QGY.subsampled.400.fastq.gz"
    ],
    "atac.fastqs_rep1_R2" : [
        "input/ENCSR356KRQ/ENCFF248EJF.subsampled.400.fastq.gz",
        "input/ENCSR356KRQ/ENCFF368TYI.subsampled.400.fastq.gz"
    ],
    "atac.fastqs_rep2_R1" : [
        "input/ENCSR356KRQ/ENCFF641SFZ.subsampled.400.fastq.gz",
        "input/ENCSR356KRQ/ENCFF751XTV.subsampled.400.fastq.gz",
        "input/ENCSR356KRQ/ENCFF927LSG.subsampled.400.fastq.gz",
        "input/ENCSR356KRQ/ENCFF859BDM.subsampled.400.fastq.gz",
        "input/ENCSR356KRQ/ENCFF193RRC.subsampled.400.fastq.gz",
        "input/ENCSR356KRQ/ENCFF366DFI.subsampled.400.fastq.gz"
    ],
    "atac.fastqs_rep2_R2" : [
         "input/ENCSR356KRQ/ENCFF031ARQ.subsampled.400.fastq.gz",
         "input/ENCSR356KRQ/ENCFF590SYZ.subsampled.400.fastq.gz",
         "input/ENCSR356KRQ/ENCFF734PEQ.subsampled.400.fastq.gz",
         "input/ENCSR356KRQ/ENCFF007USV.subsampled.400.fastq.gz",
         "input/ENCSR356KRQ/ENCFF886FSC.subsampled.400.fastq.gz",
         "input/ENCSR356KRQ/ENCFF573UXK.subsampled.400.fastq.gz"
    ],
    "atac.paired_end" : true,
    "atac.auto_detect_adapter" : true,
    "atac.enable_xcor" : true,
    "atac.title" : "ENCSR356KRQ (subsampled 1/400)",
    "atac.description" : "ATAC-seq on primary keratinocytes in day 0.0 of differentiation"
}

In this example the pipeline will only be run locally - i.e. it will not submit tasks as slurm jobs. Follow the caper docs to set up a config file for slurm submission. This has to be done only once.

[user@cn3144]$ [[ -d ~/.caper ]] && mv ~/.caper ~/caper.$(date +%F).bak # back up old caper config
[user@cn3144]$ mkdir -p ~/.caper && caper init local
[user@cn3144]$ # note the need for --singularity in this version
[user@cn3144]$ caper run $EASP_WDL -i ENCSR356KRQ_subsampled.json.2.2.0 --singularity
[...much output...]
This workflow ran successfully. There is nothing to troubleshoot

This version of the pipeline comes with a tool to copy and organize pipeline output.

[user@cn3144]$ ls atac
a0fb9f58-ede3-4c02-9bcc-26d21ab5ccbb
[user@cn3144]$ croo --method copy --out-dir=${wd}/ENCSR889WQX \
    atac/a0fb9f58-ede3-4c02-9bcc-26d21ab5ccbb/metadata.json

Batch job

Most jobs should be run as batch jobs.

Create a batch input file. For example the following batch job will run a local job (assuming the caper config file is set up correctly):

#! /bin/bash

wd=$PWD
module load encode-atac-seq-pipeline/2.2.0 || exit 1

cd /lscratch/$SLURM_JOB_ID

mkdir input
cp -rL $EASP_TEST_DATA/* .
caper run $EASP_WDL -i ENCSR356KRQ_subsampled.json.2.2.0
rc=$?
croo --method copy --out-dir=${wd}/ENCSR356KRQ \
    atac/*/metadata.json
exit $rc

Submit this job using the Slurm sbatch command.

sbatch --time=4:00:00 --cpus-per-task=8 --mem=20g --gres=lscratch:50 encode-atac-seq-pipeline.sh