High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
Bam-readcount on Biowulf and Helix

Bam-readcount generates metrics at single nucleotide positions.
There are number of metrics generated which can be useful for filtering out false positive calls.


Running on Helix

Sample session:

helix$ module load bamreadcount
helix$ cd /data/$USER/dir
helix$ bam-readcount -h
Usage: bam-readcount [OPTIONS]  [region]
Generate metrics for bam_file at single nucleotide positions.
Example: bam-readcount -f ref.fa some.bam

Available options:
  -h [ --help ]                         produce this message
  -v [ --version ]                      output the version number
  -q [ --min-mapping-quality ] arg (=0) minimum mapping quality of reads used 
                                        for counting.
  -b [ --min-base-quality ] arg (=0)    minimum base quality at a position to 
                                        use the read for counting.
  -d [ --max-count ] arg (=10000000)    max depth to avoid excessive memory 
                                        usage.
  -l [ --site-list ] arg                file containing a list of regions to 
                                        report readcounts within.
  -f [ --reference-fasta ] arg          reference sequence in the fasta format.
  -D [ --print-individual-mapq ] arg    report the mapping qualities as a comma
                                        separated list.
  -p [ --per-library ]                  report results by library.
  -w [ --max-warnings ] arg             maximum number of warnings of each type
                                        to emit. -1 gives an unlimited number.
  -i [ --insertion-centric ]            generate indel centric readcounts. 
                                        Reads containing insertions will not be
                                        included in per-base counts

Submitting a single batch job

1. Create a script file. The file will contain the lines similar to the lines below.

#! /bin/bash 
#SBATCH --mail-type=BEGIN,END,FAIL 

module load bamreadcount 
cd /data/$USER/dir 
bam-readcount -f ref.fa some.bam
....
....

2. Submit the script on Biowulf.

$ sbatch myscript

see biowulf user guide for more options such as allocate more memory and longer walltime if needed.

Submit a swarm of jobs

Using the 'swarm' utility, one can submit many jobs to the cluster to run concurrently.

Set up a swarm command file (eg /data/$USER/cmdfile). Here is a sample file:

cd /data/user/run1/; bam-readcount -f ref.fa some.bam
cd /data/user/run2/; bam-readcount -f ref.fa some.bam
cd /data/user/run3/; bam-readcount -f ref.fa some.bam
........

The -f flag is required to specify swarm file name.

Submit the swarm job:

$ swarm -f swarmfile --module bamreadcount

For more information regarding running swarm, see swarm.html

 

Running an interactive job

User may need to run jobs interactively sometimes. Such jobs should not be run on the Biowulf login node. Instead allocate an interactive node as described below, and run the interactive job there.

[user@biowulf]$ sinteractive 

[user@pXXXX]$ cd /data/$USER/myruns

[user@pXXXX]$ module load bamreadcount

[user@pXXXX]$ bam-readcount -f ref.fa some.bam
[user@pXXXX]$ exit
slurm stepepilog here!
                   
[user@biowulf]$ 

Documentation

https://github.com/genome/bam-readcount