From the Encode documentation:
This pipeline is designed for automated end-to-end quality control and processing of ATAC-seq or DNase-seq data. The pipeline can be run on compute clusters with job submission engines or stand alone machines. It inherently makes uses of parallelized/distributed computing.
- Description on the Encode site
- On GitHub
- Input Json description
- Output description
- Module Name: encode-atac-seq-pipeline (see the modules page for more information)
- This pipeline has undergone frequent updates with backwards incompatible changes to execution and configuration. If in doubt please refer back to the upstream documentation.
- For local runs the CPU and memory consumption varies over time and its magnitude depends on the number of replicates and size of input data. For the example below (2 fastq files per replicaate, 2 replicates), 10-12 CPUs and 16GB of memory were sufficient.
- As of version 2.0, we no longer provide a Slurm backend configuration. We recommend use of this pipeline in 'local' backend mode. If you believe you have a use for Slurm backend execution, please get in touch with HPC staff.
- Environment variables set (not all of them are set in each version due to significant
changes in the pipeline):
-
$EASP_BACKEND_CONF
: configuration for local backend -
$EASP_WFOPTS
: singularity backend opts (Versions > 1.0 only) -
$EASP_WDL
: WDL file defining the workflow -
$EASP_TEST_DATA
: Input data for example below
-
- Reference data in /fdb/encode-atac-seq-pipeline/<version>
Allocate an interactive session and run the program. Sample session:
A note about resource allocation:
- Set the number of concurrent tasks (NUM_CONCURRENT_TASKS) to the number of replicates
- Set
"atac.bowtie2_cpu"
in the input json to the number of CPUs you want bowtie2 to use. Usually 8 or so. - Allocate
NUM_CONCURRENT_TASKS * atac.bowtie2_cpu
CPUs - Allocate
20GB * NUM_CONCURRENT_TASKS
for big samples and10GB * NUM_CONCURRENT_TASKS
for small samples
WDL based workflows need a json file to define input and settings for a workflow run. In this example, we will use the 76nt data from ENCODE sample ENCSR356KRQ (keratinocyte). This includes 2 and 6 fastq files respectively for each of 2 replicates.
- 2.2.0
Continues to use the v3 annotation. However, caper apparently changed significantly so you should backup your old caper configuration and create fresh config files for this version.
[user@biowulf]$ sinteractive --cpus-per-task=8 --mem=20g --gres=lscratch:30 salloc.exe: Pending job allocation 46116226 salloc.exe: job 46116226 queued and waiting for resources salloc.exe: job 46116226 has been allocated resources salloc.exe: Granted job allocation 46116226 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn3144 are ready for job [user@cn3144]$ wd=$PWD # so we can copy results back later [user@cn3144]$ cd /lscratch/$SLURM_JOB_ID [user@cn3144]$ module load encode-atac-seq-pipeline/2.2.0 [user@cn3144]$ cp -Lr ${EASP_TEST_DATA:-none}/* . [user@cn3144]$ tree . ├── ENCSR356KRQ_subsampled.json.2.2.0 └── input └── ENCSR356KRQ ├── ENCFF007USV.subsampled.400.fastq.gz ├── ENCFF031ARQ.subsampled.400.fastq.gz ├── ENCFF106QGY.subsampled.400.fastq.gz ├── ENCFF193RRC.subsampled.400.fastq.gz ├── ENCFF248EJF.subsampled.400.fastq.gz ├── ENCFF341MYG.subsampled.400.fastq.gz ├── ENCFF366DFI.subsampled.400.fastq.gz ├── ENCFF368TYI.subsampled.400.fastq.gz ├── ENCFF573UXK.subsampled.400.fastq.gz ├── ENCFF590SYZ.subsampled.400.fastq.gz ├── ENCFF641SFZ.subsampled.400.fastq.gz ├── ENCFF734PEQ.subsampled.400.fastq.gz ├── ENCFF751XTV.subsampled.400.fastq.gz ├── ENCFF859BDM.subsampled.400.fastq.gz ├── ENCFF886FSC.subsampled.400.fastq.gz ├── ENCFF927LSG.subsampled.400.fastq.gz └── hg38.tsv [user@cn3144]$ cat ENCSR356KRQ_subsampled.json.2.2.0 { "atac.pipeline_type" : "atac", "atac.genome_tsv" : "/fdb/encode-atac-seq-pipeline/v3/hg38/hg38.tsv", "atac.fastqs_rep1_R1" : [ "input/ENCSR356KRQ/ENCFF341MYG.subsampled.400.fastq.gz", "input/ENCSR356KRQ/ENCFF106QGY.subsampled.400.fastq.gz" ], "atac.fastqs_rep1_R2" : [ "input/ENCSR356KRQ/ENCFF248EJF.subsampled.400.fastq.gz", "input/ENCSR356KRQ/ENCFF368TYI.subsampled.400.fastq.gz" ], "atac.fastqs_rep2_R1" : [ "input/ENCSR356KRQ/ENCFF641SFZ.subsampled.400.fastq.gz", "input/ENCSR356KRQ/ENCFF751XTV.subsampled.400.fastq.gz", "input/ENCSR356KRQ/ENCFF927LSG.subsampled.400.fastq.gz", "input/ENCSR356KRQ/ENCFF859BDM.subsampled.400.fastq.gz", "input/ENCSR356KRQ/ENCFF193RRC.subsampled.400.fastq.gz", "input/ENCSR356KRQ/ENCFF366DFI.subsampled.400.fastq.gz" ], "atac.fastqs_rep2_R2" : [ "input/ENCSR356KRQ/ENCFF031ARQ.subsampled.400.fastq.gz", "input/ENCSR356KRQ/ENCFF590SYZ.subsampled.400.fastq.gz", "input/ENCSR356KRQ/ENCFF734PEQ.subsampled.400.fastq.gz", "input/ENCSR356KRQ/ENCFF007USV.subsampled.400.fastq.gz", "input/ENCSR356KRQ/ENCFF886FSC.subsampled.400.fastq.gz", "input/ENCSR356KRQ/ENCFF573UXK.subsampled.400.fastq.gz" ], "atac.paired_end" : true, "atac.auto_detect_adapter" : true, "atac.enable_xcor" : true, "atac.title" : "ENCSR356KRQ (subsampled 1/400)", "atac.description" : "ATAC-seq on primary keratinocytes in day 0.0 of differentiation" }
In this example the pipeline will only be run locally - i.e. it will not submit tasks as slurm jobs. Follow the caper docs to set up a config file for slurm submission. This has to be done only once.
[user@cn3144]$ [[ -d ~/.caper ]] && mv ~/.caper ~/caper.$(date +%F).bak # back up old caper config [user@cn3144]$ mkdir -p ~/.caper && caper init local [user@cn3144]$ # note the need for --singularity in this version [user@cn3144]$ caper run $EASP_WDL -i ENCSR356KRQ_subsampled.json.2.2.0 --singularity [...much output...] This workflow ran successfully. There is nothing to troubleshoot
This version of the pipeline comes with a tool to copy and organize pipeline output.
[user@cn3144]$ ls atac a0fb9f58-ede3-4c02-9bcc-26d21ab5ccbb [user@cn3144]$ croo --method copy --out-dir=${wd}/ENCSR889WQX \ atac/a0fb9f58-ede3-4c02-9bcc-26d21ab5ccbb/metadata.json
Create a batch input file. For example the following batch job will run a local job (assuming the caper config file is set up correctly):
#! /bin/bash wd=$PWD module load encode-atac-seq-pipeline/2.2.0 || exit 1 cd /lscratch/$SLURM_JOB_ID mkdir input cp -rL $EASP_TEST_DATA/* . caper run $EASP_WDL -i ENCSR356KRQ_subsampled.json.2.2.0 rc=$? croo --method copy --out-dir=${wd}/ENCSR356KRQ \ atac/*/metadata.json exit $rc
Submit this job using the Slurm sbatch command.
sbatch --time=4:00:00 --cpus-per-task=8 --mem=20g --gres=lscratch:50 encode-atac-seq-pipeline.sh