PEPATAC: a modular pipeline for ATAC-seq data processing
PEPATAC is a robust pipeline for Assay for Transposase-Accessible Chromatin using sequencing (ATAC-seq) built on a loosely coupled modular framework. It may be easily applied to ATAC-seq projects of any size, from one-off experiments to large-scale sequencing projects. It is optimized on unique features of ATAC-seq data to be fast and accurate and provides several unique analytical approaches.
References:
- Jason P. Smith, M. Ryan Corces, Jin Xu, Vincent P. Reuter, Howard Y. Chang, and Nathan C. Sheffield,
PEPATAC: An optimized pipeline for ATAC-seq data analysis with serial alignments
bioRxiv preprint doi: https://doi.org/10.1101/2020.10.21.3470.
Documentation
Important Notes
- Module Name: PEPATAC (see the modules page for more information)
- Unusual environment variables set
- PEPATAC_HOME installation directory
- PEPATAC_BIN executable directory
- PEPATAC_SRC source code directory
- PEPATAC_DATA sample data directory
Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.
Allocate an interactive session and run the program. Sample session:
[user@biowulf]$ sinteractive --cpus-per-task=8 --mem=8g --gres=lscratch:20 --time=36:00:00 [user@cn3200 ~]$ module load PEPATAC/0.12.2 [+] Loading pepatac 0.12.2 [+] Loading singularity 3.8.5-1 on cn3089 [user@cn3200 ~]$ pepatac -h usage: pepatac.py [-h] [-R] [-N] [-D] [-F] [-T] [--silent] [--verbosity V] [--logdev] [-C CONFIG_FILE] [-O PARENT_OUTPUT_FOLDER] [-M MEMORY_LIMIT] [-P NUMBER_OF_CORES] [-S SAMPLE_NAME] -I INPUT_FILES [INPUT_FILES ...] [-I2 [INPUT_FILES2 [INPUT_FILES2 ...]]] -G GENOME_ASSEMBLY [-Q SINGLE_OR_PAIRED] [--trimmer {trimmomatic,pyadapt,skewer}] [--aligner {bowtie2,bwa}] [--deduplicator {picard,samblaster,samtools}] [--peak-caller {fseq,fseq2,genrich,hmmratac,homer,macs2}] [-gs GENOME_SIZE] [--peak-type {fixed,variable}] [--extend EXTEND] [--frip-ref-peaks FRIP_REF_PEAKS] [--motif] [--sob] [--no-scale] [--prioritize] [--keep] [--noFIFO] [--lite] [--skipqc] [--prealignment-names PREALIGNMENT_NAMES [PREALIGNMENT_NAMES ...]] [--prealignment-index PREALIGNMENT_INDEX [PREALIGNMENT_INDEX ...]] --genome-index GENOME_INDEX --chrom-sizes CHROM_SIZES [--TSS-name TSS_NAME] [--blacklist BLACKLIST] [--anno-name ANNO_NAME] [--search-file SEARCH_FILE] [-V] PEPATAC version 0.12.2 optional arguments: -h, --help show this help message and exit -R, --recover Overwrite locks to recover from previous failed run -N, --new-start Overwrite all results to start a fresh run -D, --dirty Don't auto-delete intermediate files -F, --force-follow Always run 'follow' commands -T, --testmode Only print commands, don't run --silent Silence logging. Overrides verbosity. --verbosity V Set logging level (1-5 or logging module level name) --logdev Expand content of logging message format. -C CONFIG_FILE, --config CONFIG_FILE Pipeline configuration file (YAML). Relative paths are with respect to the pipeline script. -O PARENT_OUTPUT_FOLDER, --output-parent PARENT_OUTPUT_FOLDER Parent output directory of project -M MEMORY_LIMIT, --mem MEMORY_LIMIT Memory limit for processes accepting such. Default units are megabytes unless specified using the suffix [K|M|G|T]. -P NUMBER_OF_CORES, --cores NUMBER_OF_CORES Number of cores for parallelized processes -S SAMPLE_NAME, --sample-name SAMPLE_NAME Name for sample to run -I2 [INPUT_FILES2 [INPUT_FILES2 ...]], --input2 [INPUT_FILES2 [INPUT_FILES2 ...]] Secondary input files, such as read2 -Q SINGLE_OR_PAIRED, --single-or-paired SINGLE_OR_PAIRED Single- or paired-end sequencing protocol --trimmer {trimmomatic,pyadapt,skewer} Name of read trimming program. --aligner {bowtie2,bwa} Name of read aligner. --deduplicator {picard,samblaster,samtools} Name of deduplicator program. --peak-caller {fseq,fseq2,genrich,hmmratac,homer,macs2} Name of peak caller. -gs GENOME_SIZE, --genome-size GENOME_SIZE Effective genome size. It can be 1.0e+9 or 1000000000: e.g. human (2.7e9), mouse (1.87e9), C. elegans (9e7), fruitfly (1.2e8). Default:2.7e9 --peak-type {fixed,variable} Call variable or fixed width peaks. Fixed width requires MACS2. --extend EXTEND How far to extend fixed width peaks up and downstream. --frip-ref-peaks FRIP_REF_PEAKS Path to reference peak set (BED format) for calculating FRiP. --motif Perform motif enrichment analysis. --sob Use seqOutBias to produce signal tracks, incorporate mappability information, and account for Tn5 bias. --no-scale Do not scale signal tracks: Default is to scale by read count. If using seqOutBias, scales by the expected/observed cut frequency. --prioritize Plot cFRiF/FRiF using mutually exclusive priority ranked features based on the order of feature appearance in the feature annotation asset. --keep Enable this flag to keep prealignment BAM files. --noFIFO Do NOT use named pipes during prealignments. --lite Only keep minimal, essential output to conserve disk space. --skipqc Skip FastQC. Useful for bugs in FastQC that appear with some sequence read files. --prealignment-names PREALIGNMENT_NAMES [PREALIGNMENT_NAMES ...] Space-delimited list of prealignment genome names to align to before primary alignment. --prealignment-index PREALIGNMENT_INDEX [PREALIGNMENT_INDEX ...] Space-delimited list of prealignment genome name and index files delimited by an equals sign to align to before primary alignment. e.g. rCRSd=/path/to/bowtie2_index/. --genome-index GENOME_INDEX Path to primary genome index file. Either a bowtie2 or bwa index. --chrom-sizes CHROM_SIZES Path to primary genome chromosome sizes file. --TSS-name TSS_NAME Path to TSS annotation file. --blacklist BLACKLIST Path to genomic region blacklist file. --anno-name ANNO_NAME Path to reference annotation file (BED format) for calculating FRiF. --search-file SEARCH_FILE Required for seqOutBias (--sob). Path to tallymer index search file built with the same read length as the input. -V, --version show program's version number and exit required named arguments: -I INPUT_FILES [INPUT_FILES ...], --input INPUT_FILES [INPUT_FILES ...] One or more primary input files -G GENOME_ASSEMBLY, --genome GENOME_ASSEMBLY Identifier for genome assemblyDescribed below are the steps to run PEPATAC on the tutorial example.
1. Set up folders:
[user@cn3200 ~]$ mkdir pepatac_tutorial [user@cn3200 ~]$ export TUTORIAL=$PWD/pepatac_tutorial [user@cn3200 ~]$ cd $TUTORIAL [user@cn3200 ~]$ mkdir data genomes processed templates tools [user@cn3200 ~]$ cd $TUTORIAL/tools [user@cn3200 ~]$ git clone https://github.com/databio/pepatac.git [user@cn3200 ~]$ cd $TUTORIAL/tools/pepatac [user@cn3200 ~]$ git checkout tags/v0.12.2 Previous HEAD position was 7616783... Merge pull request #199 from databio/dev HEAD is now at 1348557... Merge pull request #181 from databio/dev2. Initialize refgenie and download assets:
[user@cn3200 ~]$ cd $TUTORIAL/genomes [user@cn3200 ~]$ export REFGENIE=$TUTORIAL/genomes/genome_config.yaml [user@cn3200 ~]$ refgenie init -c $REFGENIE [user@cn3200 ~]$ refgenie pull hg38/fasta hg38/bowtie2_index hg38/refgene_anno hg38/ensembl_gtf hg38/ensembl_rb [user@cn3200 ~]$ refgenie build hg38/feat_annotation [user@cn3200 ~]$ refgenie pull rCRSd/fasta [user@cn3200 ~]$ refgenie pull rCRSd/bowtie2_indexAdd the export REFGENIE line to your .bashrc or .profile to ensure it persists.
3. Download tutorial read files:
[user@cn3200 ~]$ wget http://big.databio.org/pepatac/tutorial1_r1.fastq.gz [user@cn3200 ~]$ wget http://big.databio.org/pepatac/tutorial1_r2.fastq.gz [user@cn3200 ~]$ wget http://big.databio.org/pepatac/tutorial2_r1.fastq.gz [user@cn3200 ~]$ wget http://big.databio.org/pepatac/tutorial2_r2.fastq.gz [user@cn3200 ~]$ mv *.fastq.gz $TUTORIAL/tools/pepatac/examples/data/4. Configure project files:
[user@cn3200 ~]$ cd $TUTORIAL [user@cn3200 ~]$ export DIVCFG=$TUTORIAL/compute_config.yaml [user@cn3200 ~]$ divvy init --config $DIVCFGAdd the export DIVCFG line to your ~/.bashrc to ensure it persists.
[user@cn3200 ~]$ cd $TUTORIAL/templates [user@cn3200 ~]$ cp $PEPATAC_CONFIG/localhost_template.sub .The PEPATAC pipeline is divided into two major parts:
1) first, it processes each sample individually at the sample level;
2) once sample processing is complete, the project-level part aggregates, analyzes, and summarizes the results across samples.
5. Run the PEPATAC pipeline at the sample level using looper:
[user@cn3200 ~]$ cd $TUTORIAL/processed [user@cn3200 ~]$ looper run -c $PEPATAC_EXAMPLES/tutorial/.looper_tutorial_refgenie.yaml Looper version: 2.0.0 Command: run Pipeline name mismatch detected. Pipeline interface: PEPATAC Output schema: None Defaulting to pipeline_interface value. ## [1 of 2] sample: tutorial1; pipeline: PEPATAC Calling pre-submit function: refgenconf.looper_refgenie_populate Refgenie asset (hg38/bowtie2_index) tag not specified in `refgenie.tag_overrides` section. Using the default tag: default Refgenie asset (hg38/ensembl_gtf) tag not specified in `refgenie.tag_overrides` section. Using the default tag: default Refgenie asset (hg38/ensembl_rb) tag not specified in `refgenie.tag_overrides` section. Using the default tag: default Refgenie asset (hg38/fasta) tag not specified in `refgenie.tag_overrides` section. Using the default tag: default Refgenie asset (hg38/feat_annotation) tag not specified in `refgenie.tag_overrides` section. Using the default tag: default Refgenie asset (hg38/refgene_anno) tag not specified in `refgenie.tag_overrides` section. Using the default tag: default Refgenie asset (rCRSd/bowtie2_index) tag not specified in `refgenie.tag_overrides` section. Using the default tag: default Refgenie asset (rCRSd/fasta) tag not specified in `refgenie.tag_overrides` section. Using the default tag: default Writing script to /data/denisovga/PEPATAC/pepatac_tutorial/processed/submission/PEPATAC_tutorial1.sub Job script (n=1; 0.03Gb): /data/denisovga/PEPATAC/pepatac_tutorial/processed/submission/PEPATAC_tutorial1.sub Compute node: cn0811 Start time: 2025-03-06 12:40:04 No pipestat output schema was supplied to PipestatManager. Initializing results file '/data/denisovga/PEPATAC/pepatac_tutorial/processed/results_pipeline/tutorial1/stats.yaml' ... ... Looper finished Samples valid for job generation: 2 of 2The previous command produces two sbatch scripts.
Finally, submit these scripts to the cluster:
[user@cn3200 ~]$ sbatch $TUTORIAL/processed/submission/PEPATAC_tutorial1.sub [user@cn3200 ~]$ sbatch $TUTORIAL/processed/submission/PEPATAC_tutorial2.sub6. When the two submitted jobs are completed, run the PEPATAC pipeline at the project level using looper:
[user@cn3200 ~]$ looper runp -c $PEPATAC_EXAMPLES/tutorial/.looper_tutorial_refgenie.yaml Looper version: 2.0.0 Command: runp ## [1 of 1] project: PEPATAC_tutorial; pipeline: PEPATAC Writing script to /data/denisovga/PEPATAC/pepatac_tutorial/processed/submission/PEPATAC_PEPATAC_tutorial.sub Job script (n=1; 0.00Gb): /data/denisovga/PEPATAC/pepatac_tutorial/processed/submission/PEPATAC_PEPATAC_tutorial.sub Compute node: cn0811 Start time: 2025-03-09 10:28:44 No pipestat output schema was supplied to PipestatManager. ### Pipeline run code and environment: * Command: `/data/nammid2/pepatac_tutorial/tools/pepatac/pipelines/pepatac_collator.py --config /usr/local/apps/PEPATAC/0.12.2/pepatac-0.12.2/examples/tutorial/tutorial_refgenie_pep_config.yaml -O /data/nammid2/pepatac_tutorial/processed/results_pipeline -P 1 -M 16000 -n PEPATAC_tutorial -r /data/nammid2/pepatac_tutorial/processed/results_pipeline` * Compute host: `cn0006` * Working dir: `/vf/users/nammid2/pepatac_tutorial/processed` * Outfolder: `/data/nammid2/pepatac_tutorial/processed/results_pipeline/summary/` * Log file: `/data/nammid2/pepatac_tutorial/processed/results_pipeline/summary/PEPATAC_log.md` * Start time: (03-09 10:28:45) elapsed: 0.0 _TIME_ ### Version log: * Python version: `3.10.14` * Pypiper dir: `/opt/conda/envs/pepatac/lib/python3.10/site-packages/pypiper` * Pypiper version: `0.14.4` * Pipeline dir: `/vf/users/nammid2/pepatac_tutorial/tools/pepatac/pipelines` * Pipeline version: `0.0.7` * Pipeline hash: `9d4912c9e7d46b733e9475ee9bfde308cf68092f` * Pipeline branch: `* (HEAD detached at v0.12.2)` * Pipeline date: `2025-02-13 07:46:14 -0500` * Pipeline diff: `2 files changed, 0 insertions(+), 0 deletions(-)` ... ............................................................ ............................................................ ............................................................ ............................................................ ............................................................ ............................................................ ............................................................ ............................................................ ............................................................ ...End the interactive session:
[user@cn3200 ~]$ exit salloc.exe: Relinquishing job allocation 46116226 [user@biowulf ~]$