High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
bedtools on Biowulf

The BEDTools utilities allow one to address common genomics tasks such finding feature overlaps and computing coverage. In addition, one can develop sophisticated pipelines that answer complicated research questions by chaining several BEDTools together.

Bedtools contains a single binary (bedtools) that takes a number of commands as as it's first argument.

bedtools <subcommand> [options]

Subcommands available in version 2.27.1

intersectFind overlapping intervals in various ways.
windowFind overlapping intervals within a window around an interval.
closestFind the closest, potentially non-overlapping interval.
coverageCompute the coverage over defined intervals.
mapApply a function to a column for each overlapping interval.
genomecovCompute the coverage over an entire genome.
mergeCombine overlapping/nearby intervals into a single interval.
clusterCluster (but don't merge) overlapping/nearby intervals.
complementExtract intervals _not_ represented by an interval file.
shiftAdjust the position of intervals.
subtractRemove intervals based on overlaps b/w two files.
slopAdjust the size of intervals.
flankCreate new intervals from the flanks of existing intervals.
sortOrder the intervals in a file.
randomGenerate random intervals in a genome.
shuffleRandomly redistrubute intervals in a genome.
sampleSample random records from file using reservoir sampling.
spacingReport the gap lengths between intervals in a file.
annotateAnnotate coverage of features from multiple files.
multiinterIdentifies common intervals among multiple interval files.
unionbedgCombines coverage intervals from multiple BEDGRAPH files.
pairtobedFind pairs that overlap intervals in various ways.
pairtopairFind pairs that overlap other pairs in various ways.
bamtobedConvert BAM alignments to BED (& other) formats.
bedtobamConvert intervals to BAM records.
bamtofastqConvert BAM records to FASTQ records.
bedpetobamConvert BEDPE intervals to BAM records.
bed12tobed6Breaks BED12 intervals into discrete BED6 intervals.
getfastaUse intervals to extract sequences from a FASTA file.
maskfastaUse intervals to mask sequences from a FASTA file.
nucProfile the nucleotide content of intervals in a FASTA file.
multicovCounts coverage from multiple BAMs at specific intervals.
tagTag BAM alignments based on overlaps with interval files.
jaccardCalculate the Jaccard statistic b/w two sets of intervals.
reldistCalculate the distribution of relative distances b/w two files.
fisherCalculate Fisher statistic b/w two feature files.
overlapComputes the amount of overlap from two intervals.
igvCreate an IGV snapshot batch script.
linksCreate a HTML page of links to UCSC locations.
makewindowsMake interval "windows" across a genome.
groupbyGroup by common cols. & summarize oth. cols. (~ SQL "groupBy")
expandReplicate lines based on lists of values in columns.
splitSplit a file into multiple files with equal records or base pairs.


Important Notes

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Sample session:

[user@biowulf]$ sinteractive
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144 ~]$ module load bedtools
[user@cn3144 ~]$ cp $BEDTOOLS_TEST_DATA/* .
[user@cn3144 ~]$ ls -lh
total 250M
-rw-r--r-- 1 user group  115 Feb  3 09:43 chroms.txt
-rw-r--r-- 1 user group 245M Feb  3 09:43 ENCFF001KOM.bam
-rw-r--r-- 1 user group 2.1M Feb  3 09:43 ENCFF001XYE.bed
-rw-r--r-- 1 user group 2.2M Feb  3 09:43 mm9_knownGene_promoters.bed

Merge identical intervals. Names (column 4 of bed file) of merged intervals are concatenated. All intervals are 2000nts, each centered on a promoter from the knownGenes table. Note that all example files used here are sorted identically. Note the negative distance with indicates the number of bp overlap required for merging.

[user@cn3144 ~]$ bedtools merge -d -2000 -c 4 -o distinct \
    -i mm9_knownGene_promoters.bed \
  > mm9_knownGene_upromoters.bed
[user@cn3144 ~]$ wc -l mm9_knownGene_promoters.bed
55105 mm9_knownGene_promoters.bed
[user@cn3144 ~]$ wc -l mm9_knownGene_upromoters.bed
38862 mm9_knownGene_upromoters.bed

For each promoter interval, add a new column that contains the count of H3K27ac peaks that overlap it

[user@cn3144 ~]$ bedtools intersect -c -sorted \
    -a mm9_knownGene_upromoters.bed \
    -b ENCFF001XYE.bed \
    > mm9_knownGene_upromoters_ac.bed
[user@cn3144 ~]$ head -5 mm9_knownGene_upromoters_ac.bed
chr1    3204713 3206713 uc007aet.1:     0
chr1    3647985 3649985 uc007aev.1:     0
chr1    3660579 3662579 uc007aeu.1:Q5GH67       0
chr1    4349395 4351395 uc007aex.2:Q548Q8       0
chr1    4398322 4400322 uc007aew.1:NP_001182591 0

Then count the number of reads from the H3K27ac IP that overlap the same interval

[user@cn3144 ~]$ bedtools intersect -c -sorted \
    -a mm9_knownGene_upromoters_ac.bed \
    -b  ENCFF001KOM.bam \
    > mm9_knownGene_upromoters_ac_reads.bed

Show the mean and median number of reads per interval for promoters that overlap 0, 1, ... H3K27ac peaks. File needs to be sorted by the column that is used to group lines

[user@cn3144 ~]$ sort -k5,5 mm9_knownGene_upromoters_ac_reads.bed \
  | bedtools groupby -g 5 -c 6,6,6 -o count,mean,median
0       24848   7.26295878943979378306  3
1       12288   95.410400390625 67
2       1675    89.6119402985074628987  74
3       48      86.4791666666666714036  62
4       3       43.6666666666666642982  49

[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. bedtools.sh), which uses the input file 'bedtools.in'. For example:

set -e
set -o pipefail

module load bedtools
bedtools merge -d -2000 -c 4 -o distinct \
    -i mm9_knownGene_promoters.bed\
  > mm9_knownGene_upromoters.bed
bedtools intersect -c -sorted \
    -a mm9_knownGene_upromoters.bed \
    -b ENCFF001XYE.bed \
    > mm9_knownGene_upromoters_ac.bed
bedtools intersect -c -sorted \
    -a mm9_knownGene_upromoters_ac.bed \
    -b  ENCFF001KOM.bam \
    > mm9_knownGene_upromoters_ac_reads.bed
sort -k5,5 mm9_knownGene_upromoters_ac_reads.bed \
  | bedtools groupby -g 5 -c 6,6,6 -o count,mean,median \
  > median_counts.tsv

Submit this job using the Slurm sbatch command.

sbatch [--cpus-per-task=#] [--mem=#] bedtools.sh
Swarm of Jobs
A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.

Create a swarmfile (e.g. bedtools.swarm). For example:

bamToBed -i file1.bam > file1.bed
bamToBed -i file2.bam > file2.bed
bamToBed -i file3.bam > file3.bed

Submit this job using the swarm command.

swarm -f bedtools.swarm [-g #] [-t #] --module bedtools
-g # Number of Gigabytes of memory required for each process (1 line in the swarm command file)
-t # Number of threads/CPUs required for each process (1 line in the swarm command file).
--module bedtools Loads the bedtools module for each subjob in the swarm