High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
samblaster on Biowulf & Helix

Description

samblaster is a program for marking duplicates and finding discordant/split read pairs in read-id grouped paired-end SAM files. When marking duplicates, samblaster will use about 20MB per 1M read pairs. In a read-id grouped SAM file all alignments for a read-id (QNAME) are continuous. Aligners naturally produce such files. They can also be created by sorting a SAM file by read-id.

References

Web sites

On Helix

Set up the environment

helix$ module load samblaster
[+] Loading samblaster 0.1.22

Run samblaster on a bam file sorted by read name with duplicates already marked. Save discordant pairs to disc.sam and split reads to split.sam

helix$ samtools view -h /usr/local/apps/samblaster/TEST_DATA/test.bam \
  | samblaster --ignoreUnmated -a -e -d disc.sam -s split.sam -o /dev/null
Batch job on Biowulf

Create a batch file similar to the following

#! /bin/bash

module load samtools samblaster || exit 1
samtools view -h /path/to/input.bam \
  | samblaster -e -d disc.sam -s split.sam -o /dev/null

Submit to the queue, requesting about 20MB per 1M reads

b2$ sbatch --module samblaster --cpus-per-task=2 --mem=2GB batchscript.sh
Swarm of jobs on Biowulf

Create a swarm command file similar to the following (line continuations are allowed)

samtools view -h /path/to/input1.bam \
  | samblaster -e -d disc1.sam -s split1.sam -o /dev/null
samtools view -h /path/to/input2.bam \
  | samblaster -e -d disc2.sam -s split2.sam -o /dev/null
samtools view -h /path/to/input3.bam \
  | samblaster -e -d disc3.sam -s split3.sam -o /dev/null

and submit the jobs with

b2$ swarm -t2 -g 2 -f samblaster.cmd
Interactive job on Biowulf

Allocate an interactive session with sufficient memory and then use as described above

b2$ sinteractive -c2 --mem=4G
salloc.exe: Granted job allocation 5202342
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn1875 are ready for job
srun: error: x11: no local DISPLAY defined, skipping
cn1875$ samtools view -h /usr/local/apps/samblaster/TEST_DATA/test.bam \
  | samblaster --ignoreUnmated -a -e -d disc.sam -s split.sam -o /dev/null
cn1875$ exit
Documentation