bedtools on Biowulf

Quick Links

The BEDTools utilities allow one to address common genomics tasks such finding feature overlaps and computing coverage. In addition, one can develop sophisticated pipelines that answer complicated research questions by chaining several BEDTools together.

Bedtools contains a single binary (bedtools) that takes a number of commands as as it's first argument.

bedtools <subcommand> [options]

Subcommands available in version 2.29.0

Genome
intersect	Find overlapping intervals in various ways.
window	Find overlapping intervals within a window around an interval.
closest	Find the closest, potentially non-overlapping interval.
coverage	Compute the coverage over defined intervals.
map	Apply a function to a column for each overlapping interval.
genomecov	Compute the coverage over an entire genome.
merge	Combine overlapping/nearby intervals into a single interval.
cluster	Cluster (but don't merge) overlapping/nearby intervals.
complement	Extract intervals _not_ represented by an interval file.
shift	Adjust the position of intervals.
subtract	Remove intervals based on overlaps b/w two files.
slop	Adjust the size of intervals.
flank	Create new intervals from the flanks of existing intervals.
sort	Order the intervals in a file.
random	Generate random intervals in a genome.
shuffle	Randomly redistribute intervals in a genome.
annotate	Annotate coverage of features from multiple files.
Multi-way
multiinter	Identifies common intervals among multiple interval files.
unionbedg	Combines coverage intervals from multiple BEDGRAPH files.
Paired-end
pairtobed	Find pairs that overlap intervals in various ways.
pairtopair	Find pairs that overlap other pairs in various ways.
Format
bamtobed	Convert BAM alignments to BED (& other) formats.
bedtobam	Convert intervals to BAM records.
bamtofastq	Convert BAM records to FASTQ records.
bedpetobam	Convert BEDPE intervals to BAM records.
bed12tobed6	Breaks BED12 intervals into discrete BED6 intervals.
Fasta
getfasta	Use intervals to extract sequences from a FASTA file.
maskfasta	Use intervals to mask sequences from a FASTA file.
nuc	Profile the nucleotide content of intervals in a FASTA file.
BAM
multicov	Counts coverage from multiple BAMs at specific intervals.
tag	Tag BAM alignments based on overlaps with interval files.
Statistical
jaccard	Calculate the Jaccard statistic b/w two sets of intervals.
reldist	Calculate the distribution of relative distances b/w two files.
fisher	Calculate Fisher statistic b/w two feature files.
Miscellaneous
overlap	Computes the amount of overlap from two intervals.
igv	Create an IGV snapshot batch script.
links	Create a HTML page of links to UCSC locations.
makewindows	Make interval "windows" across a genome.
groupby	Group by common cols. & summarize oth. cols. (~ SQL "groupBy")
expand	Replicate lines based on lists of values in columns.
summary	Statistical summary of intervals in a file.

References:

Aaron R. Quinlan and Ira M. Hall. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 2010, 26:841-842 Pubmed | PMC | Journal

Documentation

Important Notes

Module Name: bedtools (see the modules page for more information)
Example files in $BEDTOOLS_TEST_DATA

Interactive job

Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Sample session:

[user@biowulf]$ sinteractive
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144 ~]$ module load bedtools
[user@cn3144 ~]$ cp $BEDTOOLS_TEST_DATA/* .
[user@cn3144 ~]$ ls -lh
total 250M
-rw-r--r-- 1 user group  115 Feb  3 09:43 chroms.txt
-rw-r--r-- 1 user group 245M Feb  3 09:43 ENCFF001KOM.bam
-rw-r--r-- 1 user group 2.1M Feb  3 09:43 ENCFF001XYE.bed
-rw-r--r-- 1 user group 2.2M Feb  3 09:43 mm9_knownGene_promoters.bed

Merge identical intervals. Names (column 4 of bed file) of merged intervals are concatenated. All intervals are 2000nts, each centered on a promoter from the knownGenes table. Note that all example files used here are sorted identically. Note the negative distance with indicates the number of bp overlap required for merging.

[user@cn3144 ~]$ bedtools merge -d -2000 -c 4 -o distinct \
    -i mm9_knownGene_promoters.bed \
  > mm9_knownGene_upromoters.bed
[user@cn3144 ~]$ wc -l mm9_knownGene_promoters.bed
55105 mm9_knownGene_promoters.bed
[user@cn3144 ~]$ wc -l mm9_knownGene_upromoters.bed
38862 mm9_knownGene_upromoters.bed

For each promoter interval, add a new column that contains the count of H3K27ac peaks that overlap it

[user@cn3144 ~]$ bedtools intersect -c -sorted \
    -a mm9_knownGene_upromoters.bed \
    -b ENCFF001XYE.bed \
    > mm9_knownGene_upromoters_ac.bed
[user@cn3144 ~]$ head -5 mm9_knownGene_upromoters_ac.bed
chr1    3204713 3206713 uc007aet.1:     0
chr1    3647985 3649985 uc007aev.1:     0
chr1    3660579 3662579 uc007aeu.1:Q5GH67       0
chr1    4349395 4351395 uc007aex.2:Q548Q8       0
chr1    4398322 4400322 uc007aew.1:NP_001182591 0

Then count the number of reads from the H3K27ac IP that overlap the same interval

[user@cn3144 ~]$ bedtools intersect -c -sorted \
    -a mm9_knownGene_upromoters_ac.bed \
    -b  ENCFF001KOM.bam \
    > mm9_knownGene_upromoters_ac_reads.bed

Show the mean and median number of reads per interval for promoters that overlap 0, 1, ... H3K27ac peaks. File needs to be sorted by the column that is used to group lines

[user@cn3144 ~]$ sort -k5,5 mm9_knownGene_upromoters_ac_reads.bed \
  | bedtools groupby -g 5 -c 6,6,6 -o count,mean,median
0       24848   7.26295878943979378306  3
1       12288   95.410400390625 67
2       1675    89.6119402985074628987  74
3       48      86.4791666666666714036  62
4       3       43.6666666666666642982  49

[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$

Batch job

Most jobs should be run as batch jobs.

Create a batch input file (e.g. bedtools.sh), which uses the input file 'bedtools.in'. For example:

#!/bin/bash 
set -e
set -o pipefail

module load bedtools
bedtools merge -d -2000 -c 4 -o distinct \
    -i mm9_knownGene_promoters.bed\
  > mm9_knownGene_upromoters.bed
bedtools intersect -c -sorted \
    -a mm9_knownGene_upromoters.bed \
    -b ENCFF001XYE.bed \
    > mm9_knownGene_upromoters_ac.bed
bedtools intersect -c -sorted \
    -a mm9_knownGene_upromoters_ac.bed \
    -b  ENCFF001KOM.bam \
    > mm9_knownGene_upromoters_ac_reads.bed
sort -k5,5 mm9_knownGene_upromoters_ac_reads.bed \
  | bedtools groupby -g 5 -c 6,6,6 -o count,mean,median \
  > median_counts.tsv

Submit this job using the Slurm sbatch command.

sbatch [--cpus-per-task=#] [--mem=#] bedtools.sh

Swarm of Jobs

A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.

Create a swarmfile (e.g. bedtools.swarm). For example:

bamToBed -i file1.bam > file1.bed
bamToBed -i file2.bam > file2.bed
bamToBed -i file3.bam > file3.bed

Submit this job using the swarm command.

swarm -f bedtools.swarm [-g #] [-t #] --module bedtools

where

-g #	Number of Gigabytes of memory required for each process (1 line in the swarm command file)
-t #	Number of threads/CPUs required for each process (1 line in the swarm command file).
--module bedtools	Loads the bedtools module for each subjob in the swarm