Biowulf High Performance Computing at the NIH
PEPATAC: a modular pipeline for ATAC-seq data processing

PEPATAC is a robust pipeline for Assay for Transposase-Accessible Chromatin using sequencing (ATAC-seq) built on a loosely coupled modular framework. It may be easily applied to ATAC-seq projects of any size, from one-off experiments to large-scale sequencing projects. It is optimized on unique features of ATAC-seq data to be fast and accurate and provides several unique analytical approaches.

Documentation
Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Sample session:

[user@biowulf]$ sinteractive --cpus-per-task=16 --mem=32g --gres=lscratch:10
[user@cn3200 ~]$module load PEPATAC 
[+] Loading bedtools  2.27.1 
[+] Loading bowtie  2-2.3.4.1 
[+] Loading gcc  7.2.0  ... 
[+] Loading GSL 2.4 for GCC 7.2.0 ... 
[+] Loading openmpi 3.0.0  for GCC 7.2.0 
[+] Loading R 3.5.0_build2 
[-] Unloading gcc  7.2.0  ... 
[-] Unloading GSL 2.4 for GCC 7.2.0 ... 
[-] Unloading openmpi 3.0.0  for GCC 7.2.0 
[-] Unloading R 3.5.0_build2 
[+] Loading gcc  7.2.0  ... 
[+] Loading GSL 2.4 for GCC 7.2.0 ... 
[+] Loading openmpi 3.0.0  for GCC 7.2.0 
[+] Loading R 3.5.0_build2 
[+] Loading picard  2.17.11 
[+] Loading fastqc  0.11.6 
[+] Loading samtools 1.9  ... 
[+] Loading python 2.7  ... 
[+] Loading PEPATAC  0.8.3 

[user@cn3200 ~]$ git clone https://github.com/databio/pepatac
[user@cn3200 ~]$ cd pepatac
[user@cn3200 ~]$ pepatac.py  --help
usage: pepatac.py [-h] [-R] [-N] [-D] [-F] [-C CONFIG_FILE] -O
                  PARENT_OUTPUT_FOLDER [-M MEMORY_LIMIT] [-P NUMBER_OF_CORES]
                  -S SAMPLE_NAME -I INPUT_FILES [INPUT_FILES ...]
                  [-I2 [INPUT_FILES2 [INPUT_FILES2 ...]]] -G GENOME_ASSEMBLY
                  [-Q SINGLE_OR_PAIRED] [-gs GENOME_SIZE]
                  [--frip-ref-peaks FRIP_REF_PEAKS] [--TSS-name TSS_NAME]
                  [--anno-name ANNO_NAME] [--keep] [--noFIFO]
                  [--peak-caller {fseq,macs2}]
                  [--trimmer {trimmomatic,pyadapt,skewer}]
                  [--prealignments PREALIGNMENTS [PREALIGNMENTS ...]] [-V]

PEPATAC version 0.8.3

optional arguments:
  -h, --help            show this help message and exit
  -R, --recover         Overwrite locks to recover from previous failed run
  -N, --new-start       Overwrite all results to start a fresh run
  -D, --dirty           Don't auto-delete intermediate files
  -F, --force-follow    Always run 'follow' commands
  -C CONFIG_FILE, --config CONFIG_FILE
                        Pipeline configuration file (YAML). Relative paths are
                        with respect to the pipeline script.
  -M MEMORY_LIMIT, --mem MEMORY_LIMIT
                        Memory limit (in Mb) for processes accepting such
  -P NUMBER_OF_CORES, --cores NUMBER_OF_CORES
                        Number of cores for parallelized processes
  -I2 [INPUT_FILES2 [INPUT_FILES2 ...]], --input2 [INPUT_FILES2 [INPUT_FILES2 ...]]
                        Secondary input files, such as read2
  -Q SINGLE_OR_PAIRED, --single-or-paired SINGLE_OR_PAIRED
                        Single- or paired-end sequencing protocol
  -gs GENOME_SIZE, --genome-size GENOME_SIZE
                        genome size for MACS2
  --frip-ref-peaks FRIP_REF_PEAKS
                        Reference peak set for calculating FRiP
  --TSS-name TSS_NAME   Name of TSS annotation file
  --anno-name ANNO_NAME
                        Name of reference bed file for calculating FRiF
  --keep                Keep prealignment BAM files
  --noFIFO              Do NOT use named pipes during prealignments
  --peak-caller {fseq,macs2}
                        Name of peak caller
  --trimmer {trimmomatic,pyadapt,skewer}
                        Name of read trimming program
  --prealignments PREALIGNMENTS [PREALIGNMENTS ...]
                        Space-delimited list of reference genomes to align to
                        before primary alignment.
  -V, --version         show program's version number and exit

required named arguments:
  -O PARENT_OUTPUT_FOLDER, --output-parent PARENT_OUTPUT_FOLDER
                        Parent output directory of project
  -S SAMPLE_NAME, --sample-name SAMPLE_NAME
                        Name for sample to run
  -I INPUT_FILES [INPUT_FILES ...], --input INPUT_FILES [INPUT_FILES ...]
                        One or more primary input files
  -G GENOME_ASSEMBLY, --genome GENOME_ASSEMBLY
                        Identifier for genome assembly
Reference data are stored in /fdb/PEPATAC, in refgenie format.
Here is the command to run the pipeline using the human reference data:
[user@cn3200 ~]$ pipelines/pepatac.py --single-or-paired paired     --prealignments rCRSd human_repeats     --genome hg38     --sample-name test_hg --input $PEPATAC_DATA/test1_r1.fastq.gz --input2 $PEPATAC_DATA/test1_r2.fastq.gz  --genome-size hs     -O . -P 16
...
Processing with 5 cores...
[Name: chr13; Size: 114364328]
[Name: chr12; Size: 133275309]
[Name: chr11; Size: 135086622]
[Name: chr10; Size: 133797422]
[Name: chr17; Size: 83257441]
[Name: chr16; Size: 90338345]
[Name: chr15; Size: 101991189]
[Name: chr19; Size: 58617616]
[Name: chr18; Size: 80373285]
[Name: chr17_KI270729v1_random; Size: 280839]
[Name: chr16_KI270728v1_random; Size: 1872759]
[Name: chrUn_GL000214v1; Size: 137718]
[Name: chrUn_KI270744v1; Size: 168472]
[Name: chrUn_GL000216v2; Size: 176608]
[Name: chrUn_GL000195v1; Size: 182896]
[Name: chr9_KI270719v1_random; Size: 176845]
[Name: chr14; Size: 107043718]
[Name: chr17_GL000205v2_random; Size: 185591]
[Name: chr1_KI270709v1_random; Size: 66860]
[Name: chr14_KI270722v1_random; Size: 194050]
[Name: chrUn_KI270519v1; Size: 138126]
[Name: chr5; Size: 181538259]
[Name: chrUn_KI270438v1; Size: 112505]
[Name: chr14_GL000225v1_random; Size: 211173]
[Name: chrX; Size: 156040895]
[Name: chrUn_KI270467v1; Size: 3920]
[Name: chr22; Size: 50818468]
[Name: chr20; Size: 64444167]
[Name: chr21; Size: 46709983]
[Name: chrUn_KI270468v1; Size: 4055]
[Name: chr7; Size: 159345973]
[Name: chr6; Size: 170805979]
[Name: chr4; Size: 190214555]
[Name: chr3; Size: 198295559]
[Name: chr2; Size: 242193529]
[Name: chr1; Size: 248956422]
[Name: chrUn_KI270466v1; Size: 1233]
[Name: chr9; Size: 138394717]
[Name: chr8; Size: 145138636]
Discarding 156 chunk(s) of reads: ['chrUn_KI270748v1', 'chrUn_KI270337v1', 'chrUn_KI270749v1', 'chr1_KI270713v1_random', 'chrUn_KI270418v1', 'chrUn_KI270305v1', ...
...
Merging 39 files into output file: '/scratch/denisovga/PEPATAC/pepatac_test/test_hg/aligned_hg38_exact/test_hg_exact.bw'
Merging 39 files into output file: '/scratch/denisovga/PEPATAC/pepatac_test/test_hg/aligned_hg38/test_hg_smooth.bw'

...
INFO  @ Wed, 31 Oct 2018 14:44:22:
# Command line: callpeak -t /scratch/denisovga/PEPATAC/pepatac_test/test_hg/aligned_hg38_exact/test_hg_shift.bed -f BED -g hs --outdir /scratch/denisovga/PEPATAC/pepatac_test/test_hg/peak_calling_hg38 -n test_hg -q 0.01 --shift 0 --nomodel
# ARGUMENTS LIST:
# name = test_hg
# format = BED
# ChIP-seq file = ['/scratch/denisovga/PEPATAC/pepatac_test/test_hg/aligned_hg38_exact/test_hg_shift.bed']
# control file = None
# effective genome size = 2.70e+09
# band width = 300
# model fold = [5, 50]
# qvalue cutoff = 1.00e-02
# Larger dataset will be scaled towards smaller dataset.
# Range for calculating regional lambda is: 10000 bps
# Broad region calling is off
# Paired-End mode is off
...
> `Time`        0:08:37 PEPATAC _RES_
> `Success`     10-31-14:45:41  PEPATAC _RES_

##### [Epilogue:]
*   Total elapsed time:  0:08:37
*     Peak memory used:  0.04 GB
* Pipeline completed at:  (10-31 14:45:41) elapsed: 6.0 _TIME_

The following command works with mouse reference data:
[user@cn3200 ~]$  pipelines/pepatac.py --single-or-paired paired   --genome mm10    --sample-name test_mm --input $PEPATAC_DATA/test1_r1.fastq.gz --input2 $PEPATAC_DATA/test1_r2.fastq.gz  --genome-size hs     -O . -P 16 
...
Processing with 5 cores...
[Name: chrX; Size: 171031299]
[Name: chr13; Size: 120421639]
[Name: chr12; Size: 120129022]
[Name: chr10; Size: 130694993]
[Name: chr17; Size: 94987271]
[Name: chr19; Size: 61431566]
[Name: chrM; Size: 16299]
[Name: chr4; Size: 156508116]
[Name: chr3; Size: 160039680]
[Name: chr2; Size: 182113224]
[Name: chr8; Size: 129401213]

Discarding 54 chunk(s) of reads: ['chrUn_GL456379', 'chrUn_GL456387', 'chrUn_GL456382', 'chr4_JH584292_random', 'chrUn_GL456383', 'chr4_JH584293_random', 'chrY', 'chr11', 'chr16', 'chr15', 'chr18', 'chr4_JH584294_random', 'chr4_GL456350_random', 'chrY_JH584302_random', 'chrUn_GL456389', 'chrX_GL456233_random', 'chrUn_JH584304', 'chrUn_GL456366', 'chrUn_GL456367', 'chr1_GL456212_random', 'chr5_JH584299_random', 'chrUn_GL456360', 'chr5_JH584296_random', 'chrUn_GL456394', 'chr1_GL456210_random', 'chrUn_GL456372', 'chrUn_GL456368', 'chrUn_GL456385', 'chr5_JH584298_random', 'chrUn_GL456393', 'chrUn_GL456392', 'chrY_JH584300_random', 'chrUn_GL456396', 'chrUn_GL456378', 'chr4_GL456216_random', 'chr1_GL456213_random', 'chr7', 'chr6', 'chr5', 'chr1', 'chr4_JH584295_random', 'chrUn_GL456239', 'chrY_JH584303_random', 'chr9', 'chrUn_GL456390', 'chr5_JH584297_random', 'chr7_GL456219_random', 'chrY_JH584301_random', 'chr1_GL456221_random', 'chr5_GL456354_random', 'chr1_GL456211_random', 'chrUn_GL456370', 'chrUn_GL456359', 'chrUn_GL456381']
Keeping 12 chunk(s) of reads: ['chrX', 'chr13', 'chr12', 'chr10', 'chr17', 'chr14', 'chr19', 'chrM', 'chr4', 'chr3', 'chr2', 'chr8']
Reduce step (merge files)...
...
> `Time`        0:04:49 PEPATAC _RES_
> `Success`     11-01-07:54:24  PEPATAC _RES_

##### [Epilogue:]
*   Total elapsed time:  0:04:49
*     Peak memory used:  0.04 GB
* Pipeline completed at:  (11-01 07:54:24) elapsed: 6.0 _TIME_

Pypiper terminating spawned child process 53727...(tee)

End the interactive session:
[user@cn3200 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$