High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
bamUtil on Biowulf & Helix


bamUtil is a collection of utilities for manipulating bam files compiled into a single executable called bam. Available tools:

Tools to rewrite SAM/BAM files
convert Convert SAM/BAM to SAM/BAM
writeRegion Write a file with reads in the specified region and/or have the specified read name
splitChromosome Split BAM by Chromosome
splitBam Split a BAM file into multiple BAM files based on ReadGroup
findCigars Output just the reads that contain any of the specified CIGAR operations.
Tools to modify SAM/BAM files
clipOverlap Clip overlapping read pairs in a SAM/BAM File already sorted by Coordinate or ReadName
filter Filter reads by clipping ends with too high of a mismatch percentage and by marking reads unmapped if the quality of mismatches is too high
revert Revert SAM/BAM replacing the specified fields with their previous values (if known) and removes specified tags
squeeze reduces files size by dropping OQ fields, duplicates, & specified tags, using '=' when a base matches the reference, binning quality scores, and replacing readNames with unique integers
trimBam Trim the ends of reads in a SAM/BAM file changing read ends to 'N' and quality to '!'
mergeBam merge multiple BAMs and headers appending ReadGroupIDs if necessary
polishBam adds/updates header lines & adds the RG tag to each record
dedup Mark Duplicates
recab Recalibrate
Tools to extract information
validate Validate a SAM/BAM File
diff Diff 2 coordinate sorted SAM/BAM files.
stats Stats a SAM/BAM File
gapInfo Print information on the gap between read pairs in a SAM/BAM File.
dumpHeader Print SAM/BAM Header
dumpRefInfo Print SAM/BAM Reference Name Information
dumpIndex Print BAM Index File in English
readReference Print the reference string for the specified region
explainFlags Describe flags
bam2FastQ Convert the specified BAM file to fastQs.
readIndexedBam Read Indexed BAM By Reference and write it from reference id

Web sites

Running bamUtil on Helix

Available versions of bamUtil can be listed with

helix$ module avail bamutil
----------------- /usr/local/lmod/modulefiles -----------------
   bamutil/1.0.12    bamutil/1.0.13 (D)

   (D):  Default Module

After adding bamUtil to the path, use bam tool [args] to manipulate bam files

helix$ module load bamutil/1.0.13
helix$ cd /data/$USER/test_data/bam
helix$ bam stats --in read1_250k.bam --qual --basic
Number of records read = 170016

TotalReads      170016.00
MappedReads     170016.00
PairedReads     0.00
ProperPair      0.00
helix$ bam explainFlags --dec 96
0x60 (96):
        mate reverse strand
        first fragment
helix$ bam dedup --in read1_500k_sorted.bam --out /scratch/$USER/temp.bam \
helix$ cat /scratch/$USER/temp.bam.log
Total number of reads: 326389
Total number of paired-end reads: 0
Total number of properly paired reads: 0
Total number of unmapped reads: 0
Total number of reverse strand mapped reads: 162220
Total number of QC-failed reads: 0
Total number of secondary reads: 0
Size of singleKeyMap (must be zero): 0
Size of pairedKeyMap (must be zero): 0
Total number of missing mates: 0
Total number of reads excluded from duplicate checking: 0
Sorting the indices of 3651 duplicated records
Running a single bamUtil batch job on Biowulf

Set up a batch script to deduplicate and recalibrate qualities:

#! /bin/bash
#SBATCH --job-name=bamUtil

set -e
module load bamutil/1.0.13

# bamutil wants to write to the directory containing the genome file
mkdir /lscratch/$USER
cp $genome /lscratch/$USER
lgenome=/lscratch/$USER/$(basename $genome)

bam dedup --in $inbam --out $obam \
  --rmDups --verbose --recab \
  --refFile $lgenome \
  --dbsnp $dbsnp \
  --storeQualTag OQ --maxBaseQual 40
rm -rf /lscratch/$USER
The batch script is submitted for processing with

sbatch --mem=4g bamUtil_batch_script.sh
Running a swarm of bamUtil batch jobs on Biowulf

Set up a swarm file with one task per line (line continuations allowed). For example, to squeeze the file size of a set of bam files:

bam squeeze --in file1.bam --out file1s.bam --refFile genome.fa 
bam squeeze --in file2.bam --out file2s.bam --refFile genome.fa 
bam squeeze --in file3.bam --out file3s.bam --refFile genome.fa 

The swarm file is then executed with default settings

biowulf$ swarm -f swarmfile --module bamutil
Running an interactive job on Biowulf

After starting an interactive sesssion on a compute node with sinteractive, bamUtil is used as described above. For example

biowulf$ sinteractive
salloc.exe: Granted job allocation nnnnnn
srun: error: x11: no local DISPLAY defined, skipping
cn0147$ module load bamutil
cn0147$ bam stats --in read1_250k.bam --qual --basic
cn0147$ exit

Each individual tool in the table above is linked to it's corresponding section in the bamUtil manual