High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
bedtools on Biowulf2 & Helix

Description

The BEDTools utilities allow one to address common genomics tasks such finding feature overlaps and computing coverage. In addition, one can develop sophisticated pipelines that answer complicated research questions by chaining several BEDTools together.

Bedtools contains a single binary (bedtools) that takes a number of commands as as it's first argument:

bedtools <subcommand> [options]

Subcommands:

Genome arithmetic
intersect Find overlapping intervals in various ways.
windowFind overlapping intervals within a window around an interval.
closestFind the closest, potentially non-overlapping interval.
coverageCompute the coverage over defined intervals.
mapApply a function to a column for each overlapping interval.
genomecovCompute the coverage over an entire genome.
mergeCombine overlapping/nearby intervals into a single interval.
clusterCluster (but don't merge) overlapping/nearby intervals.
complementExtract intervals _not_ represented by an interval file.
subtractRemove intervals based on overlaps b/w two files.
slopAdjust the size of intervals.
flankCreate new intervals from the flanks of existing intervals.
sortOrder the intervals in a file.
randomGenerate random intervals in a genome.
shuffleRandomly redistrubute intervals in a genome.
sampleSample random records from file using reservoir sampling.
spacingReport the gap lengths between intervals in a file.
annotateAnnotate coverage of features from multiple files.
Multi-way file comparisons
multiinterIdentifies common intervals among multiple interval files.
unionbedgCombines coverage intervals from multiple BEDGRAPH files.
Paired-end manipulation
pairtobedFind pairs that overlap intervals in various ways.
pairtopairFind pairs that overlap other pairs in various ways.
Format conversion
bamtobedConvert BAM alignments to BED (& other) formats.
bedtobamConvert intervals to BAM records.
bamtofastqConvert BAM records to FASTQ records.
bedpetobamConvert BEDPE intervals to BAM records.
bed12tobed6Breaks BED12 intervals into discrete BED6 intervals.
Fasta manipulation
getfastaUse intervals to extract sequences from a FASTA file.
maskfastaUse intervals to mask sequences from a FASTA file.
nucProfile the nucleotide content of intervals in a FASTA file.
BAM focused tools
multicov Counts coverage from multiple BAMs at specific intervals.
tag Tag BAM alignments based on overlaps with interval files.
Satistical relationships
jaccardCalculate the Jaccard statistic b/w two sets of intervals.
reldistCalculate the distribution of relative distances b/w two files.
fisherCalculate Fisher statistic b/w two feature files.
Miscellaneous tools
overlapComputes the amount of overlap from two intervals.
igvCreate an IGV snapshot batch script.
linksCreate a HTML page of links to UCSC locations.
makewindowsMake interval "windows" across a genome.
groupbyGroup by common cols. & summarize oth. cols. (~ SQL "groupBy")
expandReplicate lines based on lists of values in columns.
splitSplit a file into multiple files with equal records or base pairs.

In addtion, there are also wrapper scripts for each of the tools with the original names of the individual executable. For example coverageBed is a synonym for bedtools coverage

There are multiple versions of bedtools available. An easy way of selecting the version is to use modules. To see the modules available, type

module avail bedtools 

To select a module use

module load bedtools/[version]

where [version] is the version of choice.

Environment variables set

Web sites

References

Running bedtools on Helix

Set up the environment and copy some test data to the local directory:

[helix]$ module load bedtools
[helix]$ cp /usr/local/apps/bedtools/TEST_DATA/* .

Merge identical intervals. Names (column 4 of bed file) of merged intervals are concatenated. All intervals are 2000nts, each centered on a promoter from the knownGenes table. Note that all example files used here are sorted identically.

[helix]$ bedtools merge -d -2000 -c 4 -o distinct \
    -i mm9_knownGene_promoters.bed\
  > mm9_knownGene_upromoters.bed
[helix]$ wc -l mm9_knownGene_promoters.bed
55105 mm9_knownGene_promoters.bed
[helix]$ wc -l mm9_knownGene_upromoters.bed
38862 mm9_knownGene_upromoters.bed

For each promoter interval, add a new column that contains the count of H3K27ac that overlap it

[helix]$ bedtools intersect -c -sorted \
    -a mm9_knownGene_upromoters.bed \
    -b ENCFF001XYE.bed \
    > mm9_knownGene_upromoters_ac.bed

And then count the number of reads from the H3K27ac IP that overlap the same interval

[helix]$ bedtools intersect -c -sorted \
    -a mm9_knownGene_upromoters_ac.bed \
    -b  ENCFF001KOM.bam \
    > mm9_knownGene_upromoters_ac_reads.bed

Show the mean and median number of reads per interval for promoters that overlap 0, 1, ... H3K27ac peaks. File needs to be sorted by the column that is used to group lines

[helix]$ sort -k5,5 mm9_knownGene_upromoters_ac_reads.bed \
  | bedtools groupby -g 5 -c 6,6,6 -o count,mean,median
0       24848   7.26295878943979378306  3
1       12288   95.410400390625 67
2       1675    89.6119402985074628987  74
3       48      86.4791666666666714036  62
4       3       43.6666666666666642982  49

Running a single bedtools batch job on Biowulf2

Create a batch script similar to the following example using the example data copied above:

#!/bin/bash 
set -e
set -o pipefail

module load bedtools
bedtools merge -d -2000 -c 4 -o distinct \
    -i mm9_knownGene_promoters.bed\
  > mm9_knownGene_upromoters.bed
bedtools intersect -c -sorted \
    -a mm9_knownGene_upromoters.bed \
    -b ENCFF001XYE.bed \
    > mm9_knownGene_upromoters_ac.bed
bedtools intersect -c -sorted \
    -a mm9_knownGene_upromoters_ac.bed \
    -b  ENCFF001KOM.bam \
    > mm9_knownGene_upromoters_ac_reads.bed
sort -k5,5 mm9_knownGene_upromoters_ac_reads.bed \
  | bedtools groupby -g 5 -c 6,6,6 -o count,mean,median \
  > median_counts.tsv
Submit to the queue with sbatch:
[biowulf2]$ sbatch jobscript

If more memory is needed than the default (4GB), use --mem to allocate more. For sorted data, bedtools generally does not require much memory

Running a swarm of bedtools batch jobs on Biowulf2

Create a swarm command file similar to the following example:

bamToBed -i file1.bam > file1.bed
bamToBed -i file2.bam > file2.bed
bamToBed -i file3.bam > file3.bed
[...]

And submit to the queue with swarm

swarm -f cmdfile --module bedtools

For more information regarding running swarm, see the swarm documentation.

Running an interactive job on Biowulf2

Users may need to run jobs interactively sometimes. Such jobs should not be run on the Biowulf2 login node. Instead allocate an interactive node as described below, and run the interactive job there or use helix for light workloads.

[biowulf2]$ sinteractive 10
      salloc.exe: Granted job allocation 1528

[node]$ cd /data/$USER/myruns
[node]$ module load bedtools
[...]