The BEDTools utilities allow one to address common genomics tasks such finding feature overlaps and computing coverage. In addition, one can develop sophisticated pipelines that answer complicated research questions by chaining several BEDTools together.
Bedtools contains a single binary (bedtools
) that takes a
number of commands as as it's first argument.
bedtools <subcommand> [options]
Genome | |
---|---|
intersect | Find overlapping intervals in various ways. |
window | Find overlapping intervals within a window around an interval. |
closest | Find the closest, potentially non-overlapping interval. |
coverage | Compute the coverage over defined intervals. |
map | Apply a function to a column for each overlapping interval. |
genomecov | Compute the coverage over an entire genome. |
merge | Combine overlapping/nearby intervals into a single interval. |
cluster | Cluster (but don't merge) overlapping/nearby intervals. |
complement | Extract intervals _not_ represented by an interval file. |
shift | Adjust the position of intervals. |
subtract | Remove intervals based on overlaps b/w two files. |
slop | Adjust the size of intervals. |
flank | Create new intervals from the flanks of existing intervals. |
sort | Order the intervals in a file. |
random | Generate random intervals in a genome. |
shuffle | Randomly redistribute intervals in a genome. |
annotate | Annotate coverage of features from multiple files. |
Multi-way | |
multiinter | Identifies common intervals among multiple interval files. |
unionbedg | Combines coverage intervals from multiple BEDGRAPH files. |
Paired-end | |
pairtobed | Find pairs that overlap intervals in various ways. |
pairtopair | Find pairs that overlap other pairs in various ways. |
Format | |
bamtobed | Convert BAM alignments to BED (& other) formats. |
bedtobam | Convert intervals to BAM records. |
bamtofastq | Convert BAM records to FASTQ records. |
bedpetobam | Convert BEDPE intervals to BAM records. |
bed12tobed6 | Breaks BED12 intervals into discrete BED6 intervals. |
Fasta | |
getfasta | Use intervals to extract sequences from a FASTA file. |
maskfasta | Use intervals to mask sequences from a FASTA file. |
nuc | Profile the nucleotide content of intervals in a FASTA file. |
BAM | |
multicov | Counts coverage from multiple BAMs at specific intervals. |
tag | Tag BAM alignments based on overlaps with interval files. |
Statistical | |
jaccard | Calculate the Jaccard statistic b/w two sets of intervals. |
reldist | Calculate the distribution of relative distances b/w two files. |
fisher | Calculate Fisher statistic b/w two feature files. |
Miscellaneous | |
overlap | Computes the amount of overlap from two intervals. |
igv | Create an IGV snapshot batch script. |
links | Create a HTML page of links to UCSC locations. |
makewindows | Make interval "windows" across a genome. |
groupby | Group by common cols. & summarize oth. cols. (~ SQL "groupBy") |
expand | Replicate lines based on lists of values in columns. |
summary | Statistical summary of intervals in a file. |
$BEDTOOLS_TEST_DATA
Allocate an interactive session and run the program. Sample session:
[user@biowulf]$ sinteractive salloc.exe: Pending job allocation 46116226 salloc.exe: job 46116226 queued and waiting for resources salloc.exe: job 46116226 has been allocated resources salloc.exe: Granted job allocation 46116226 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn3144 are ready for job [user@cn3144 ~]$ module load bedtools [user@cn3144 ~]$ cp $BEDTOOLS_TEST_DATA/* . [user@cn3144 ~]$ ls -lh total 250M -rw-r--r-- 1 user group 115 Feb 3 09:43 chroms.txt -rw-r--r-- 1 user group 245M Feb 3 09:43 ENCFF001KOM.bam -rw-r--r-- 1 user group 2.1M Feb 3 09:43 ENCFF001XYE.bed -rw-r--r-- 1 user group 2.2M Feb 3 09:43 mm9_knownGene_promoters.bed
Merge identical intervals. Names (column 4 of bed file) of merged intervals are concatenated. All intervals are 2000nts, each centered on a promoter from the knownGenes table. Note that all example files used here are sorted identically. Note the negative distance with indicates the number of bp overlap required for merging.
[user@cn3144 ~]$ bedtools merge -d -2000 -c 4 -o distinct \ -i mm9_knownGene_promoters.bed \ > mm9_knownGene_upromoters.bed [user@cn3144 ~]$ wc -l mm9_knownGene_promoters.bed 55105 mm9_knownGene_promoters.bed [user@cn3144 ~]$ wc -l mm9_knownGene_upromoters.bed 38862 mm9_knownGene_upromoters.bed
For each promoter interval, add a new column that contains the count of H3K27ac peaks that overlap it
[user@cn3144 ~]$ bedtools intersect -c -sorted \ -a mm9_knownGene_upromoters.bed \ -b ENCFF001XYE.bed \ > mm9_knownGene_upromoters_ac.bed [user@cn3144 ~]$ head -5 mm9_knownGene_upromoters_ac.bed chr1 3204713 3206713 uc007aet.1: 0 chr1 3647985 3649985 uc007aev.1: 0 chr1 3660579 3662579 uc007aeu.1:Q5GH67 0 chr1 4349395 4351395 uc007aex.2:Q548Q8 0 chr1 4398322 4400322 uc007aew.1:NP_001182591 0
Then count the number of reads from the H3K27ac IP that overlap the same interval
[user@cn3144 ~]$ bedtools intersect -c -sorted \ -a mm9_knownGene_upromoters_ac.bed \ -b ENCFF001KOM.bam \ > mm9_knownGene_upromoters_ac_reads.bed
Show the mean and median number of reads per interval for promoters that overlap 0, 1, ... H3K27ac peaks. File needs to be sorted by the column that is used to group lines
[user@cn3144 ~]$ sort -k5,5 mm9_knownGene_upromoters_ac_reads.bed \ | bedtools groupby -g 5 -c 6,6,6 -o count,mean,median 0 24848 7.26295878943979378306 3 1 12288 95.410400390625 67 2 1675 89.6119402985074628987 74 3 48 86.4791666666666714036 62 4 3 43.6666666666666642982 49 [user@cn3144 ~]$ exit salloc.exe: Relinquishing job allocation 46116226 [user@biowulf ~]$
Create a batch input file (e.g. bedtools.sh), which uses the input file 'bedtools.in'. For example:
#!/bin/bash set -e set -o pipefail module load bedtools bedtools merge -d -2000 -c 4 -o distinct \ -i mm9_knownGene_promoters.bed\ > mm9_knownGene_upromoters.bed bedtools intersect -c -sorted \ -a mm9_knownGene_upromoters.bed \ -b ENCFF001XYE.bed \ > mm9_knownGene_upromoters_ac.bed bedtools intersect -c -sorted \ -a mm9_knownGene_upromoters_ac.bed \ -b ENCFF001KOM.bam \ > mm9_knownGene_upromoters_ac_reads.bed sort -k5,5 mm9_knownGene_upromoters_ac_reads.bed \ | bedtools groupby -g 5 -c 6,6,6 -o count,mean,median \ > median_counts.tsv
Submit this job using the Slurm sbatch command.
sbatch [--cpus-per-task=#] [--mem=#] bedtools.sh
Create a swarmfile (e.g. bedtools.swarm). For example:
bamToBed -i file1.bam > file1.bed bamToBed -i file2.bam > file2.bed bamToBed -i file3.bam > file3.bed
Submit this job using the swarm command.
swarm -f bedtools.swarm [-g #] [-t #] --module bedtoolswhere
-g # | Number of Gigabytes of memory required for each process (1 line in the swarm command file) |
-t # | Number of threads/CPUs required for each process (1 line in the swarm command file). |
--module bedtools | Loads the bedtools module for each subjob in the swarm |