High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
bowtie1 on Biowulf2 & Helix

Description

Bowtie1 is a fast, multi-threaded, and memory efficient aligner for short read sequences. Bowtie uses a Burrows-Wheeler index to achieve a moderate memory footprint of 2 - 4 GB depending on genome size and alignment parameters. Performance generally scales well with thread count.

Note that this page only describes bowtie1. Bowtie2, which supports local alignment, gaps, and longer reads, is documented separately.

References

Web sites

Index files

Bowtie1 indices are available as part of the igenomes package under

/fdb/igenomes/[organism]/[source]/[build]/Sequence/BowtieIndex/*

More information on the locally available igenomes builds/organisms is available from our scientific database index. For more information about igenomes in general, iGenomes readme.

Performance considerations

The amount of time to complete the alignment of approximately 21M ChIP-Seq reads (replicates 1 and 2 of ENCODE experiment ENCSR000CDI, 36nt, H3K27ac ChIP from mouse embryonic fibroblasts) was measured as a function of the number of bowtie threads:

bowtie1 benchmarks

Based on this experiment, increasing the number of threads to more than 12 shows diminishing returns. Therefore the most resource efficient usage of bowtie1 would employ at most 12 threads. If you are gunzipping input files and piping output through samtools, please allocate extra CPUs. Otherwise, the node will be slightly overloaded, which results in a considerable performance penalty due to contention between threads.

Running bowtie1 on Helix

Bowtie is a multi-threaded application. The thread count is determined with -p / --threads and defaults to 1. Please do not run bowtie using more than 2 threads on the interactive helix system. Note that bowtie and bowtie2 are considered a single application in the environmental modules system and that therefore the default module for bowtie is the most recent version of bowtie 2. To use bowtie1, a version has to be requested explicitly.

The following example aligns single ended data, creating bam output directly and uncompressing gziped fastq on the fly

helix$  module load bowtie/1.1.1
helix$  module load samtools/1.2
helix$  cd /scratch/$USER
helix$  export BOWTIE_INDEXES=/fdb/igenomes/Mus_musculus/UCSC/mm9/Sequence/BowtieIndex/
helix$  ls $BOWTIE_INDEXES
genome.1.ebwt  genome.2.ebwt  genome.3.ebwt  genome.4.ebwt  genome.fa
genome.rev.1.ebwt  genome.rev.2.ebwt
helix$  zcat /usr/local/apps/bowtie/TEST_DATA/ENCFF001KPB.fastq.gz \
   | bowtie --strata --best --chunkmbs 256 -k2 -m1 -p2 --sam genome - \
   | samtools view -F4 -Sb - > bam/ENCFF001KPB.bam
[...snip...]
# reads processed: 9157799
# reads with at least one reported alignment: 7451956 (81.37%)
# reads that failed to align: 760226 (8.30%)
# reads with alignments suppressed due to -m: 945617 (10.33%)
Reported 7451956 alignments to 1 output stream(s)

Running a single bowtie1 batch job on Biowulf2

Batch scripts should use the $SLURM_CPUS_PER_TASK environment variable to determine the number of threads for bowtie. This variable is set by slurm according to the number of requested cores. An example script might look like the following:

#! /bin/bash
set -o pipefail
set -e

echo "Running on $SLURM_CPUS_PER_TASK cores"
module load bowtie/1.1.2 || exit 1
module load samtools/1.2 || exit 1

export BOWTIE_INDEXES=/fdb/igenomes/Mus_musculus/UCSC/mm9/Sequence/BowtieIndex/
zcat /usr/local/apps/bowtie/TEST_DATA/ENCFF001KPB.fastq.gz \
    | bowtie --strata --best -k2 -m1 --chunkmbs 256 -p $(( SLURM_CPUS_PER_TASK - 2 )) --sam genome - \
    | samtools view -F4 -Sb - > ENCFF001KPB.bam

Note that this script assigns bowtie fewer threads than allocated CPUs. The remaining CPUs are used by samtools and zcat.

The script is submitted to the batch system requesting, for example, 12 cores and 5GB of memory:

biowulf2$ sbatch --cpus-per-task=12 --mem=5g example_bowtie_slurm.sh

Running a swarm of bowtie1 batch jobs on Biowulf2

First a command file is created with one task per line (line continuations are allowed). The following example contains two tasks:

cd /data/$USER/test_data \
  && export BOWTIE_INDEXES=/fdb/igenomes/Mus_musculus/UCSC/mm9/Sequence/BowtieIndex/ \
  && zcat fastq/read1_250k.fastq.gz \
      | bowtie --strata --best -k2 -m1 -p $(( SLURM_CPUS_PER_TASK - 2 ))--sam genome - \
      | samtools view -F4 -Sb - > bam/read1_250k.bam
cd /data/$USER/test_data \
  && export BOWTIE_INDEXES=/fdb/igenomes/Mus_musculus/UCSC/mm9/Sequence/BowtieIndex/ \
  && zcat fastq/read1_500k.fastq.gz \
      | bowtie --strata --best -k2 -m1 -p $(( SLURM_CPUS_PER_TASK -2 )) --sam genome - \
      | samtools view -F4 -Sb - > bam/read1_500k.bam

The two tasks in the command file (example_bowtie_swarm) can then be executed using swarm requesting 8 cores and 5GB of memory for each task. Modules required for the tasks can be provided on the command line as well.

biowulf2$ swarm -f example_bowtie_swarm -t 8 -g 5 --module bowtie/1.1.2 samtools/1.2
2 commands run in 2 tasks, each requiring 5 gb, 8 threads, and 0 gb diskspace

Running an interactive job on Biowulf2

Interactive work requiring resources should be carried out on interactive compute nodes, not on the head node or helix. Interactive nodes are allocated with sinteractive. For example, to request a 8 core interactive node with 8GB memory:

$ sinteractive -c 8 --mem=5g

On the interactive node, bowtie is then used essentially as decribed above.

Documentation

For detailed information see the Home page and the Manual.

Usage: 
bowtie [options]* <ebwt> {-1 <m1> -2 <m2> | --12 <r> | <s>} [<hit>]

  <m1>    Comma-separated list of files containing upstream mates (or the
          sequences themselves, if -c is set) paired with mates in <m2>
  <m2>    Comma-separated list of files containing downstream mates (or the
          sequences themselves if -c is set) paired with mates in <m1>
  <r>     Comma-separated list of files containing Crossbow-style reads.  Can be
          a mixture of paired and unpaired.  Specify "-" for stdin.
  <s>     Comma-separated list of files containing unpaired reads, or the
          sequences themselves, if -c is set.  Specify "-" for stdin.
  <hit>   File to write hits to (default: stdout)
Input:
  -q                 query input files are FASTQ .fq/.fastq (default)
  -f                 query input files are (multi-)FASTA .fa/.mfa
  -r                 query input files are raw one-sequence-per-line
  -c                 query sequences given on cmd line (as <mates>, <singles>)
  -C                 reads and index are in colorspace
  -Q/--quals <file>  QV file(s) corresponding to CSFASTA inputs; use with -f -C
  --Q1/--Q2 <file>   same as -Q, but for mate files 1 and 2 respectively
  -s/--skip <int>    skip the first <int> reads/pairs in the input
  -u/--qupto <int>   stop after first <int> reads/pairs (excl. skipped reads)
  -5/--trim5 <int>   trim <int> bases from 5' (left) end of reads
  -3/--trim3 <int>   trim <int> bases from 3' (right) end of reads
  --phred33-quals    input quals are Phred+33 (default)
  --phred64-quals    input quals are Phred+64 (same as --solexa1.3-quals)
  --solexa-quals     input quals are from GA Pipeline ver. < 1.3
  --solexa1.3-quals  input quals are from GA Pipeline ver. >= 1.3
  --integer-quals    qualities are given as space-separated integers (not ASCII)
  --large-index      force usage of a 'large' index, even if a small one is present
Alignment:
  -v <int>           report end-to-end hits w/ <=v mismatches; ignore qualities
    or
  -n/--seedmms <int> max mismatches in seed (can be 0-3, default: -n 2)
  -e/--maqerr <int>  max sum of mismatch quals across alignment for -n (def: 70)
  -l/--seedlen <int> seed length for -n (default: 28)
  --nomaqround       disable Maq-like quality rounding for -n (nearest 10 <= 30)
  -I/--minins <int>  minimum insert size for paired-end alignment (default: 0)
  -X/--maxins <int>  maximum insert size for paired-end alignment (default: 250)
  --fr/--rf/--ff     -1, -2 mates align fw/rev, rev/fw, fw/fw (default: --fr)
  --nofw/--norc      do not align to forward/reverse-complement reference strand
  --maxbts <int>     max # backtracks for -n 2/3 (default: 125, 800 for --best)
  --pairtries <int>  max # attempts to find mate for anchor hit (default: 100)
  -y/--tryhard       try hard to find valid alignments, at the expense of speed
  --chunkmbs <int>   max megabytes of RAM for best-first search frames (def: 64)
Reporting:
  -k <int>           report up to <int> good alignments per read (default: 1)
  -a/--all           report all alignments per read (much slower than low -k)
  -m <int>           suppress all alignments if > <int> exist (def: no limit)
  -M <int>           like -m, but reports 1 random hit (MAPQ=0); requires --best
  --best             hits guaranteed best stratum; ties broken by quality
  --strata           hits in sub-optimal strata aren't reported (requires --best)
Output:
  -t/--time          print wall-clock time taken by search phases
  -B/--offbase <int> leftmost ref offset = <int> in bowtie output (default: 0)
  --quiet            print nothing but the alignments
  --refout           write alignments to files refXXXXX.map, 1 map per reference
  --refidx           refer to ref. seqs by 0-based index rather than name
  --al <fname>       write aligned reads/pairs to file(s) <fname>
  --un <fname>       write unaligned reads/pairs to file(s) <fname>
  --max <fname>      write reads/pairs over -m limit to file(s) <fname>
  --suppress <cols>  suppresses given columns (comma-delim'ed) in default output
  --fullref          write entire ref name (default: only up to 1st space)
Colorspace:
  --snpphred <int>   Phred penalty for SNP when decoding colorspace (def: 30)
     or
  --snpfrac <dec>    approx. fraction of SNP bases (e.g. 0.001); sets --snpphred
  --col-cseq         print aligned colorspace seqs as colors, not decoded bases
  --col-cqual        print original colorspace quals, not decoded quals
  --col-keepends     keep nucleotides at extreme ends of decoded alignment
SAM:
  -S/--sam           write hits in SAM format
  --mapq <int>       default mapping quality (MAPQ) to print for SAM alignments
  --sam-nohead       supppress header lines (starting with @) for SAM output
  --sam-nosq         supppress @SQ header lines for SAM output
  --sam-RG <text>    add <text> (usually "lab=value") to @RG line of SAM header
Performance:
  -o/--offrate <int> override offrate of index; must be >= index's offrate
  -p/--threads <int> number of alignment threads to launch (default: 1)
  --mm               use memory-mapped I/O for index; many 'bowtie's can share
  --shmem            use shared mem for index; many 'bowtie's can share
Other:
  --seed <int>       seed for random number generator
  --verbose          verbose output (for debugging)
  --version          print version information and quit
  -h/--help          print this usage message