PEPATAC: a modular pipeline for ATAC-seq data processing
PEPATAC is a robust pipeline for Assay for Transposase-Accessible Chromatin using sequencing (ATAC-seq) built on a loosely coupled modular framework. It may be easily applied to ATAC-seq projects of any size, from one-off experiments to large-scale sequencing projects. It is optimized on unique features of ATAC-seq data to be fast and accurate and provides several unique analytical approaches.
Documentation
Important Notes
- Module Name: PEPATAC (see the modules page for more information)
- Unusual environment variables set
- PEPATAC_HOME installation directory
- PEPATAC_BIN executable directory
- PEPATAC_SRC source code directory
- PEPATAC_DATA sample data directory
Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.
Allocate an interactive session and run the program. Sample session:
[user@biowulf]$ sinteractive --cpus-per-task=16 --mem=32g --gres=lscratch:10 [user@cn3200 ~]$module load PEPATAC [+] Loading bedtools 2.27.1 [+] Loading bowtie 2-2.3.4.1 [+] Loading gcc 7.2.0 ... [+] Loading GSL 2.4 for GCC 7.2.0 ... [+] Loading openmpi 3.0.0 for GCC 7.2.0 [+] Loading R 3.5.0_build2 [-] Unloading gcc 7.2.0 ... [-] Unloading GSL 2.4 for GCC 7.2.0 ... [-] Unloading openmpi 3.0.0 for GCC 7.2.0 [-] Unloading R 3.5.0_build2 [+] Loading gcc 7.2.0 ... [+] Loading GSL 2.4 for GCC 7.2.0 ... [+] Loading openmpi 3.0.0 for GCC 7.2.0 [+] Loading R 3.5.0_build2 [+] Loading picard 2.17.11 [+] Loading fastqc 0.11.6 [+] Loading samtools 1.9 ... [+] Loading python 2.7 ... [+] Loading PEPATAC 0.8.3 [user@cn3200 ~]$ git clone https://github.com/databio/pepatac [user@cn3200 ~]$ cd pepatac [user@cn3200 ~]$ pepatac.py --help usage: pepatac.py [-h] [-R] [-N] [-D] [-F] [-C CONFIG_FILE] -O PARENT_OUTPUT_FOLDER [-M MEMORY_LIMIT] [-P NUMBER_OF_CORES] -S SAMPLE_NAME -I INPUT_FILES [INPUT_FILES ...] [-I2 [INPUT_FILES2 [INPUT_FILES2 ...]]] -G GENOME_ASSEMBLY [-Q SINGLE_OR_PAIRED] [-gs GENOME_SIZE] [--frip-ref-peaks FRIP_REF_PEAKS] [--TSS-name TSS_NAME] [--anno-name ANNO_NAME] [--keep] [--noFIFO] [--peak-caller {fseq,macs2}] [--trimmer {trimmomatic,pyadapt,skewer}] [--prealignments PREALIGNMENTS [PREALIGNMENTS ...]] [-V] PEPATAC version 0.8.3 optional arguments: -h, --help show this help message and exit -R, --recover Overwrite locks to recover from previous failed run -N, --new-start Overwrite all results to start a fresh run -D, --dirty Don't auto-delete intermediate files -F, --force-follow Always run 'follow' commands -C CONFIG_FILE, --config CONFIG_FILE Pipeline configuration file (YAML). Relative paths are with respect to the pipeline script. -M MEMORY_LIMIT, --mem MEMORY_LIMIT Memory limit (in Mb) for processes accepting such -P NUMBER_OF_CORES, --cores NUMBER_OF_CORES Number of cores for parallelized processes -I2 [INPUT_FILES2 [INPUT_FILES2 ...]], --input2 [INPUT_FILES2 [INPUT_FILES2 ...]] Secondary input files, such as read2 -Q SINGLE_OR_PAIRED, --single-or-paired SINGLE_OR_PAIRED Single- or paired-end sequencing protocol -gs GENOME_SIZE, --genome-size GENOME_SIZE genome size for MACS2 --frip-ref-peaks FRIP_REF_PEAKS Reference peak set for calculating FRiP --TSS-name TSS_NAME Name of TSS annotation file --anno-name ANNO_NAME Name of reference bed file for calculating FRiF --keep Keep prealignment BAM files --noFIFO Do NOT use named pipes during prealignments --peak-caller {fseq,macs2} Name of peak caller --trimmer {trimmomatic,pyadapt,skewer} Name of read trimming program --prealignments PREALIGNMENTS [PREALIGNMENTS ...] Space-delimited list of reference genomes to align to before primary alignment. -V, --version show program's version number and exit required named arguments: -O PARENT_OUTPUT_FOLDER, --output-parent PARENT_OUTPUT_FOLDER Parent output directory of project -S SAMPLE_NAME, --sample-name SAMPLE_NAME Name for sample to run -I INPUT_FILES [INPUT_FILES ...], --input INPUT_FILES [INPUT_FILES ...] One or more primary input files -G GENOME_ASSEMBLY, --genome GENOME_ASSEMBLY Identifier for genome assemblyReference data are stored in /fdb/PEPATAC, in refgenie format.
Here is the command to run the pipeline using the human reference data:
[user@cn3200 ~]$ pipelines/pepatac.py --single-or-paired paired --prealignments rCRSd human_repeats --genome hg38 --sample-name test_hg --input $PEPATAC_DATA/test1_r1.fastq.gz --input2 $PEPATAC_DATA/test1_r2.fastq.gz --genome-size hs -O . -P 16 ... Processing with 5 cores... [Name: chr13; Size: 114364328] [Name: chr12; Size: 133275309] [Name: chr11; Size: 135086622] [Name: chr10; Size: 133797422] [Name: chr17; Size: 83257441] [Name: chr16; Size: 90338345] [Name: chr15; Size: 101991189] [Name: chr19; Size: 58617616] [Name: chr18; Size: 80373285] [Name: chr17_KI270729v1_random; Size: 280839] [Name: chr16_KI270728v1_random; Size: 1872759] [Name: chrUn_GL000214v1; Size: 137718] [Name: chrUn_KI270744v1; Size: 168472] [Name: chrUn_GL000216v2; Size: 176608] [Name: chrUn_GL000195v1; Size: 182896] [Name: chr9_KI270719v1_random; Size: 176845] [Name: chr14; Size: 107043718] [Name: chr17_GL000205v2_random; Size: 185591] [Name: chr1_KI270709v1_random; Size: 66860] [Name: chr14_KI270722v1_random; Size: 194050] [Name: chrUn_KI270519v1; Size: 138126] [Name: chr5; Size: 181538259] [Name: chrUn_KI270438v1; Size: 112505] [Name: chr14_GL000225v1_random; Size: 211173] [Name: chrX; Size: 156040895] [Name: chrUn_KI270467v1; Size: 3920] [Name: chr22; Size: 50818468] [Name: chr20; Size: 64444167] [Name: chr21; Size: 46709983] [Name: chrUn_KI270468v1; Size: 4055] [Name: chr7; Size: 159345973] [Name: chr6; Size: 170805979] [Name: chr4; Size: 190214555] [Name: chr3; Size: 198295559] [Name: chr2; Size: 242193529] [Name: chr1; Size: 248956422] [Name: chrUn_KI270466v1; Size: 1233] [Name: chr9; Size: 138394717] [Name: chr8; Size: 145138636] Discarding 156 chunk(s) of reads: ['chrUn_KI270748v1', 'chrUn_KI270337v1', 'chrUn_KI270749v1', 'chr1_KI270713v1_random', 'chrUn_KI270418v1', 'chrUn_KI270305v1', ... ... Merging 39 files into output file: '/scratch/user/PEPATAC/pepatac_test/test_hg/aligned_hg38_exact/test_hg_exact.bw' Merging 39 files into output file: '/scratch/user/PEPATAC/pepatac_test/test_hg/aligned_hg38/test_hg_smooth.bw' ... INFO @ Wed, 31 Oct 2018 14:44:22: # Command line: callpeak -t /scratch/user/PEPATAC/pepatac_test/test_hg/aligned_hg38_exact/test_hg_shift.bed -f BED -g hs --outdir /scratch/user/PEPATAC/pepatac_test/test_hg/peak_calling_hg38 -n test_hg -q 0.01 --shift 0 --nomodel # ARGUMENTS LIST: # name = test_hg # format = BED # ChIP-seq file = ['/scratch/user/PEPATAC/pepatac_test/test_hg/aligned_hg38_exact/test_hg_shift.bed'] # control file = None # effective genome size = 2.70e+09 # band width = 300 # model fold = [5, 50] # qvalue cutoff = 1.00e-02 # Larger dataset will be scaled towards smaller dataset. # Range for calculating regional lambda is: 10000 bps # Broad region calling is off # Paired-End mode is off ... > `Time` 0:08:37 PEPATAC _RES_ > `Success` 10-31-14:45:41 PEPATAC _RES_ ##### [Epilogue:] * Total elapsed time: 0:08:37 * Peak memory used: 0.04 GB * Pipeline completed at: (10-31 14:45:41) elapsed: 6.0 _TIME_The following command works with mouse reference data:
[user@cn3200 ~]$ pipelines/pepatac.py --single-or-paired paired --genome mm10 --sample-name test_mm --input $PEPATAC_DATA/test1_r1.fastq.gz --input2 $PEPATAC_DATA/test1_r2.fastq.gz --genome-size hs -O . -P 16 ... Processing with 5 cores... [Name: chrX; Size: 171031299] [Name: chr13; Size: 120421639] [Name: chr12; Size: 120129022] [Name: chr10; Size: 130694993] [Name: chr17; Size: 94987271] [Name: chr19; Size: 61431566] [Name: chrM; Size: 16299] [Name: chr4; Size: 156508116] [Name: chr3; Size: 160039680] [Name: chr2; Size: 182113224] [Name: chr8; Size: 129401213] Discarding 54 chunk(s) of reads: ['chrUn_GL456379', 'chrUn_GL456387', 'chrUn_GL456382', 'chr4_JH584292_random', 'chrUn_GL456383', 'chr4_JH584293_random', 'chrY', 'chr11', 'chr16', 'chr15', 'chr18', 'chr4_JH584294_random', 'chr4_GL456350_random', 'chrY_JH584302_random', 'chrUn_GL456389', 'chrX_GL456233_random', 'chrUn_JH584304', 'chrUn_GL456366', 'chrUn_GL456367', 'chr1_GL456212_random', 'chr5_JH584299_random', 'chrUn_GL456360', 'chr5_JH584296_random', 'chrUn_GL456394', 'chr1_GL456210_random', 'chrUn_GL456372', 'chrUn_GL456368', 'chrUn_GL456385', 'chr5_JH584298_random', 'chrUn_GL456393', 'chrUn_GL456392', 'chrY_JH584300_random', 'chrUn_GL456396', 'chrUn_GL456378', 'chr4_GL456216_random', 'chr1_GL456213_random', 'chr7', 'chr6', 'chr5', 'chr1', 'chr4_JH584295_random', 'chrUn_GL456239', 'chrY_JH584303_random', 'chr9', 'chrUn_GL456390', 'chr5_JH584297_random', 'chr7_GL456219_random', 'chrY_JH584301_random', 'chr1_GL456221_random', 'chr5_GL456354_random', 'chr1_GL456211_random', 'chrUn_GL456370', 'chrUn_GL456359', 'chrUn_GL456381'] Keeping 12 chunk(s) of reads: ['chrX', 'chr13', 'chr12', 'chr10', 'chr17', 'chr14', 'chr19', 'chrM', 'chr4', 'chr3', 'chr2', 'chr8'] Reduce step (merge files)... ... > `Time` 0:04:49 PEPATAC _RES_ > `Success` 11-01-07:54:24 PEPATAC _RES_ ##### [Epilogue:] * Total elapsed time: 0:04:49 * Peak memory used: 0.04 GB * Pipeline completed at: (11-01 07:54:24) elapsed: 6.0 _TIME_ Pypiper terminating spawned child process 53727...(tee)End the interactive session:
[user@cn3200 ~]$ exit salloc.exe: Relinquishing job allocation 46116226 [user@biowulf ~]$