Bamutil on Biowulf

Quick Links

bamUtil is a collection of utilities for manipulating bam files compiled into a single executable called bam. Available tools:

Tools to rewrite SAM/BAM files
convert	Convert SAM/BAM to SAM/BAM
writeRegion	Write a file with reads in the specified region and/or have the specified read name
splitChromosome	Split BAM by Chromosome
splitBam	Split a BAM file into multiple BAM files based on ReadGroup
findCigars	Output just the reads that contain any of the specified CIGAR operations.
Tools to modify SAM/BAM files
clipOverlap	Clip overlapping read pairs in a SAM/BAM File already sorted by Coordinate or ReadName
filter	Filter reads by clipping ends with too high of a mismatch percentage and by marking reads unmapped if the quality of mismatches is too high
revert	Revert SAM/BAM replacing the specified fields with their previous values (if known) and removes specified tags
squeeze	reduces files size by dropping OQ fields, duplicates, & specified tags, using '=' when a base matches the reference, binning quality scores, and replacing readNames with unique integers
trimBam	Trim the ends of reads in a SAM/BAM file changing read ends to 'N' and quality to '!'
mergeBam	merge multiple BAMs and headers appending ReadGroupIDs if necessary
polishBam	adds/updates header lines & adds the RG tag to each record
dedup	Mark Duplicates
recab	Recalibrate
Tools to extract information
validate	Validate a SAM/BAM File
diff	Diff 2 coordinate sorted SAM/BAM files.
stats	Stats a SAM/BAM File
gapInfo	Print information on the gap between read pairs in a SAM/BAM File.
dumpHeader	Print SAM/BAM Header
dumpRefInfo	Print SAM/BAM Reference Name Information
dumpIndex	Print BAM Index File in English
readReference	Print the reference string for the specified region
explainFlags	Describe flags
bam2FastQ	Convert the specified BAM file to fastQs.
readIndexedBam	Read Indexed BAM By Reference and write it from reference id

Documentation

bamutil Main Site

Important Notes

Module Name: bamutil (see the modules page for more information)
Singlethreaded app

Interactive job

Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Sample session:

[user@biowulf]$ sinteractive
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144 ~]$ module load bamutil
[+] Loading bamutil  1.0.14 

[user@cn3144 ~]$ cd /data/$USER/test_data/bam

[user@cn3144 ~]$ bam stats --in read1_250k.bam --qual --basic
Number of records read = 170016

TotalReads      170016.00
MappedReads     170016.00
PairedReads     0.00
ProperPair      0.00
.....

[user@cn3144 ~]$ bam explainFlags --dec 96
0x60 (96):
        mate reverse strand
        first fragment

[user@cn3144 ~]$ bam dedup --in read1_500k_sorted.bam --out /scratch/$USER/temp.bam \
--rmDups

[user@cn3144 ~]$ cat /scratch/$USER/temp.bam.log
......
--------------------------------------------------------------------------
SUMMARY STATISTICS OF THE READS
Total number of reads: 326389
Total number of paired-end reads: 0
Total number of properly paired reads: 0
Total number of unmapped reads: 0
Total number of reverse strand mapped reads: 162220
Total number of QC-failed reads: 0
Total number of secondary reads: 0
Size of singleKeyMap (must be zero): 0
Size of pairedKeyMap (must be zero): 0
Total number of missing mates: 0
Total number of reads excluded from duplicate checking: 0
--------------------------------------------------------------------------
Sorting the indices of 3651 duplicated records

[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$

Batch job

Most jobs should be run as batch jobs.

Create a batch input file (e.g. bamutil.sh). For example:

#!/bin/bash
#SBATCH --job-name=bamUtil

set -e
module load bamutil
inbam=/data/$USER/test_data/bam/gcat_set_025_raw.bam
obam=/data/$USER/test_data/temp/gcat_set_025.clean.bam
dbsnp=/data/$USER/test_data/snp/snp138_pos.hg19
genome=/fdb/igenomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa

# bamutil wants to write to the directory containing the genome file
mkdir /lscratch/$SLURM_JOB_ID
cp $genome /lscratch/$SLURM_JOB_ID
lgenome=/lscratch/$SLURM_JOB_ID/$(basename $genome)

bam dedup --in $inbam --out $obam \
  --rmDups --verbose --recab \
  --refFile $lgenome \
  --dbsnp $dbsnp \
  --storeQualTag OQ --maxBaseQual 40
rm -rf /lscratch/$SLURM_JOB_ID

Submit this job using the Slurm sbatch command.

sbatch --mem=4g bamutil.sh

Swarm of Jobs

A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.

Create a swarmfile (e.g. bamutil.swarm). For example:

bam squeeze --in file1.bam --out file1s.bam --refFile genome.fa 
bam squeeze --in file2.bam --out file2s.bam --refFile genome.fa 
bam squeeze --in file3.bam --out file3s.bam --refFile genome.fa

Submit this job using the swarm command.

swarm -f bamutil.swarm --module bamutil

where

--module bamutil Loads the bamutil module for each subjob in the swarm