ChIPseeqer: a comprehensive framework for the analysis of ChIP-seq data
ChIPseeqer is an integrative, comprehensive, fast and user-friendly computational framework for in-depth analysis of ChIP-seq datasets. It combinse several computational tools in order to create easily customized workflows that can be adapted to the user’s needs and objectives.
References:
- Eugenia G Giannopoulou and Olivier Elemento
An integrated ChIP-seq analysis platform with customizable workflows
BMC Bioinformatics 2011, 12: 277
Documentation
Important Notes
- Module Name: ChIPseeqer (see the modules page for more information)
- Unusual environment variables set
- CHIPSEEQER_HOME installation directory
- CHIPSEEQER_BIN executable directory
- CHIPSEEQER_DATA sample data dorectory
- CHIPSEEQER_SRC source code directory
- CHIPSEEQER_DOC documentation directory
Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.
Allocate an interactive session and run the program. Sample session:
[user@biowulf]$ sinteractive --mem=4g [user@@cn3200 ~]$module load ChIPseeqer [+] Loading gcc 4.8.5 ... [+] Loading boost libraries v 5.10.1 ... [+] Loading ChIPseeqer 2.1 [user@biowulf]$ ChIPseeqerAnnotate ... ChIPseeqerAnnotate --peakfile=FILE File with ChIP-seq peaks. --lenuP=INT Define the length upstream of TSS. Default is 2000bp. --lendP=INT Define the length downstream of TSS. Default is 2000bp. --lenuDW=INT Define the length upstream of TES. Default is 2000bp. --lendDW=INT Define the length downstream of TES. Default is 2000bp. --genome=STR hg19 (human) hg18 (human) mm10 (mouse) mm9 (mouse) rn4 (rat) dm3 (drosophila) sacser (Saccharomyces cerevisiae) zv9 (zebrafish) --db=STR refSeq (available for hg19, hg18, mm10, mm9, rn4, dm3, zv9) AceView (for hg19, hg18, mm9) Ensembl (for hg19, hg18, mm10, mm9, rn4, dm3, zv9) UCSCGenes (for hg19, hg18, mm10, mm9). Default is refSeq. --mindistaway=INT Define minimum distance away from transcripts, used to define the distal regions. Default is 2000bp. --maxdistal=INT Define maximum distance away from transcripts, used to define the distal regions. Default is 50000kb. --verbose=INT Verbose mode. Default is 0.Dowload data to current folder:
[user@@cn3200 ~]$ cp $CS_DATA/* dataReset the environment variable to make it point to your local data folder:
[user@@cn3200 ~]$export CS_DATA=./dataThe data folder contains a sample peaks file test_peaks.txt. To run the executable ChIPseeqerAnnotate on this file, type:
[user@@cn3200 ~]$ ChIPseeqerAnnotate --peakfile=./data/test_peaks.txt --genome=hg19 Annotation files=/usr/local/apps/ChIPseeqer/2.1/src/dist/DATA/hg19/refSeq.new Extracting [2000 - TSS - 2000] promoters ... Done. Extracting [2000 - TES - 2000] downstream extremities ... Done. Looking for distal peaks ... Looking for peaks that are > 2000 bp away from any refSeq genes Extracting extended gene bodies ... Done. Found 0 peaks within extended gene bodies (in test_peaks.txt.refSeq.GENEPEAKS), and 8 distant peaks (test_peaks.txt.refSeq.DISTPEAKS). Looking for 2 closest genes ... Running FindClosestGene, with minimum distance 0...... Done (test_peaks.txt.refSeq.DISTPEAKS.refSeq.CLOSEST_NM.txt created). Created test_peaks.txt.refSeq.DISTPEAKS.refSeq.GENEWITHPEAKS.txt Converting RefSeq NM identifiers to ORFs......Done (test_peaks.txt.refSeq.DISTPEAKS.refSeq.CLOSEST_ORF.txt created). Determining overlap between gene parts and ChIP-seq peaks ... Done (test_peaks.txt.refSeq.GP created). Creating stats file ... Done (test_peaks.txt.refSeq.GP.stats created). Creating frac file test_peaks.txt.refSeq.GP.frac .. Done Creating transcript file with number of P, E, and I peaks for each transcript ... Done (test_peaks.txt.refSeq.GP.genes created). Done (test_peaks.txt.refSeq.GP.genes.annotated created). Creating list of peaks in promoters ... test_peaks.txt.refSeq.GP.promoters Creating list of peaks in downstream extremities ... test_peaks.txt.refSeq.GP.DOWNEXTR Creating list of peaks in exons ... test_peaks.txt.refSeq.GP.exons Creating list of peaks in introns ... test_peaks.txt.refSeq.GP.introns Creating list of peaks in introns 1 ... test_peaks.txt.refSeq.GP.introns1 Creating list of peaks in introns 2 ... test_peaks.txt.refSeq.GP.introns2 Creating list of peaks in distal regions (>2000 and <50000) ... test_peaks.txt.refSeq.GP.distal Creating list of peaks in intergenic regions (>50000) ... test_peaks.txt.refSeq.GP.intergenic Number of peaks: 8 Number of peaks that overlap with [-2000;2000] PROMOTERS: 0 (%0.0) Number of peaks that overlap with [-2000;2000] DOWNSTREAM EXTREMITIES: 0 (%0.0) Number of peaks that overlap with EXONS: 0 (%0.0) Number of peaks that overlap with INTRONS: 0 (%0.0) Number of peaks that overlap with DISTAL (>2000 and <50000): 0 (%0.0) Number of peaks that overlap with INTERGENIC (>50000): 0 (%0.0)End the interactive session:
[user@cn3200 ~]$ exit salloc.exe: Relinquishing job allocation 46116226 [user@biowulf ~]$
Batch job
Most jobs should be run as batch jobs.
Create a batch input file (e.g. rcorrector.sh). For example:
#!/bin/bash #SBATCH --mem=4g module load ChIPseeqer run_rcorrector.pl -1 $RCORRECTOR_DATA/sample_read1.fq -2 $RCORRECTOR_DATA/sample_read2.fq
Submit this job using the Slurm sbatch command.
sbatch rcorrector.sh
Swarm of Jobs
A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.
Create a swarmfile (e.g. rcorrector.swarm). For example:
#!/bin/bash cd /data/$USER run_rcorrector.pl -1 $RCORRECTOR_DATA/sample_read1.fq -2 $RCORRECTOR_DATA/sample_read2.fq
Submit this job using the swarm command.
swarm -f rcorrector.swarm -g 4