High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
smalt on Biowulf & Helix

Description

SMALT is a multithreaded DNA sequencing read aligner using a combination of a hash index of short words and banded Smith-Waterman alignment to find best gapped alignments for next generation sequencing reads.

Web sites

On Helix

SMALT uses a hash index to align sequences from a fastq file. It is a multi threaded program which defaults to a single thread. Example session on helix:

helix$ module load smalt
helix$ smalt

              SMALT - Sequence Mapping and Alignment Tool
                             (version: 0.7.6)
SYNOPSIS:
    smalt  [TASK_OPTIONS] [  []]

Available tasks:
    smalt check   - checks FASTA/FASTQ input
    smalt help    - prints a brief summary of this software
    smalt index   - builds an index of k-mer words for the reference
    smalt map     - maps single or paired reads onto the reference
    smalt sample  - sample insert sizes for paired reads
    smalt version - prints version information

Help on individual tasks:
    smalt  -H
helix$ #index a small genome for alignment with short reads
helix$ smalt index -k 13 -s 3 \
  sacCer3 \
  /fdb/igenomes/Saccharomyces_cerevisiae/UCSC/sacCer3/Sequence/WholeGenomeFasta/genome.fa
[...snip...]
helix$ smalt map -f bam -o test.bam -T $PWD sacCer3 \
  /usr/local/apps/smalt/TEST_DATA/SRR332229.fastq.gz
[...snip...] 
Batch job on Biowulf

Create a batch script similar to the following

#! /bin/bash
set -e

module load smalt samtools || exit 100
tmp=$(mktemp -d /scratch/XXXXX) || exit 101
trap "rm -rf ${tmp}" EXIT
smalt map -f bam -o output.bam -T ${tmp} -n ${SLURM_CPUS_PER_TASK} \
  sacCer3 \
  /path/to/fastq.gz || exit 102

and submit to the queue with sbatch

biowulf$ sbatch --cpus-per-task=10 smalt_batch.sh
Swarm of jobs on Biowulf

Create a swarm command file similar to the following

tmp=$(mktemp -d /scratch/XXXX) \
  && smalt map -f bam -o out1.bam -T ${tmp} -n ${SLURM_CPUS_PER_TASK} \
  genome_index /path/to/fastq1.gz
tmp=$(mktemp -d /scratch/XXXX) \
  && smalt map -f bam -o out2.bam -T ${tmp} -n ${SLURM_CPUS_PER_TASK} \
  genome_index /path/to/fastq2.gz
[...snip...]

and submit the commands as batch jobs with swarm

biowulf$ swarm -g4 -t10 -f smalt_swarm_file
Interactive job on Biowulf

Allocate an interactive node and then use as shown for helix

biowulf$ sinteractive --cpus-per-task=10
[...snip...]
cn0039$ module load smalt samtools
cn0039$ smalt index [...snip...]
cn0039$ smalt map [...snip...]
cn0039$ exit
biowulf$
Documentation