High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
bedops on Biowulf & Helix

Description

Bedops is a software package for manipulating and analyzing genomic interval data. It contains tools to

Operations can be parallelized efficiently by chromosome.

Each tool is designed to use unix input and output streams for building efficient pipelines.

References

Web sites

Running bedops on Helix

Load the module that sets up the paths for bedops with

$ module load bedops

Convert a bam file of aligned ChIP-Seq reads to starch format:

$ cd /data/$USER/test_data
$ convert2bed --input=bam --output=starch --starch-note "wt/H3ac/a13" < bam_file > starch/a13.starch
$ unstarch --note starch/a13.starch
wt/H3ac/a13
$ unstarch --elements starch/a13.starch
17538004
$ unstarch --bases-uniq starch/a13.starch
199011022

Count the number of reads that fall within 1000 nts of each annotated promoter

$ bedmap --chrom chr1 --ec --delim "\t" --range 1000 --echo --count \
   annot/140609_refseq_tss.bed starch/a13.starch > a13_tss_count.bed
$ sort -k7,7gr a13_tss_count.bed | head
chr1    167718811       167718812       12503|Cd247     0       +       867
chr1    58480931        58480932        12747|Clk1      0       -       804
chr1    140071702       140071703       19264|Ptprc     0       -       784
chr1    157883609       157883610       208263|Tor1aip1 0       -       759
chr1    135975731       135975732       12227|Btg2      0       -       745

Running a single bedops batch job on Biowulf

An example script file for batch submission:

#! /bin/bash
set -e

cd /data/$USER/test_data
module load bedops
bedmap --chrom chr1 --ec --delim "\t" --range 1000 --echo --count \
   annot/140609_refseq_tss.bed starch/a13.starch > a13_tss_count.bed

is submitted with

$ sbatch runbedops.sh
$ jobload -u $USER
     JOBID      RUNTIME     NODES   CPUS    AVG CPU%            MEMORY
                                                              Used/Alloc
     17374     00:00:16     p1005      2       50.00       6.3 MB/1.5 GB

Running a swarm of bedops batch jobs on Biowulf

With bedops, swarm can be used effectively to parallelize over chromosomes. For example, the following swarm command file runs the same command on each chromosome in parallel. Note that line continuations are allowed in swarm files.

bedmap --chrom chr1  --ec --delim '\t' --range 1000 --echo --count  \
  annot/140609_refseq_tss.bed starch/a13.starch > a13_tss_count_chr1.bed
bedmap --chrom chr2  --ec --delim '\t' --range 1000 --echo --count  \
  annot/140609_refseq_tss.bed starch/a13.starch > a13_tss_count_chr2.bed
bedmap --chrom chr3  --ec --delim '\t' --range 1000 --echo --count  \
  annot/140609_refseq_tss.bed starch/a13.starch > a13_tss_count_chr3.bed
...

This file is then used by swarm to submit one job per command using the default 2 cores and 1GB of memory:

$ swarm -f bedops.swarm --module bedops
$ jobload -u $USER
     JOBID      RUNTIME     NODES   CPUS    AVG CPU%            MEMORY
                                                              Used/Alloc
   17378_2     00:00:04     p1005      2       50.00       7.6 MB/1.0 GB
   17378_1     00:00:04     p1005      2       50.00       7.7 MB/1.0 GB
   17378_0     00:00:04     p1005      2       50.00       7.6 MB/1.0 GB
...

Running an interactive job on Biowulf

Please do not do any interactive analysis on the Biowulf login node. Instead use an interactive node requested with sinteractive. For example to get a 4 core, 8GB interactive job use

$ sinteractive --mem=8g -c 4

Once an interactive node has been allocated, use bedops as described above.

Documentation