High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
Burrows-Wheeler Alignment (BWA) Tool on Biowulf & Helix

BWA is a fast light-weighted tool that aligns short sequences to a sequence database, such as the human reference genome. By default, BWA finds an alignment within edit distance 2 to the query sequence, except for disallowing gaps close to the end of the query. It can also be tuned to find a fraction of longer gaps at the cost of speed and of more false alignments.

BWA excels in its speed. Mapping 2 million high-quality 35bp short reads against the human genome can be done in 20 minutes. Usually the speed is gained at the cost of huge memory, disallowing gaps and/or the hard limits on the maximum read length and the maximum mismatches. BWA does not. It is still relatively light-weighted (2.3GB memory for human alignment), performs gapped alignment, and does not set a hard limit on read length or maximum mismatches.

Given a database file in FASTA format, BWA first builds BWT index with the 'index' command. The alignments in suffix array (SA) coordinates are then generated with the 'aln' command. The resulting file contains ALL the alignments found by BWA. The 'samse/sampe' command converts SA coordinates to chromosomal coordinates. For single-end reads, most of computing time is spent on finding the SA coordinates (the aln command). For paired-end reads, half of computing time may be spent on pairing (the sampe command) given 32bp reads. Using longer reads would reduce the fraction of time spent on pairing because each end in a pair would be mapped to fewer places.

Index Files

Pre-build BWA index files are available in

/fdb/igenomes/[organism]/[source]/[build]/Sequence/BWAIndex/genome.fa

 

Memory Requirements

As a rule of thumb, assume that bwa will require as much or more memory as the size of the .bwt index file. For example, the hg19 bwa index file is 2.9gb:

$ ls -lLh /fdb/igenomes/Homo_sapiens/UCSC/hg19/Sequence/BWAIndex/genome.fa.bwt
-rwxr-xr-x 1 maoj staff 2.9G Mar 15  2012 /fdb/igenomes/Homo_sapiens/UCSC/hg19/Sequence/BWAIndex/genome.fa.bwt

If the UCSC/hg19 BWA index file were used, the bwa process will need at least 3gb of memory.

Running on Helix

$ module load bwa
$ bwa index -a bwtsw file.csfasta
$ bwa aln -t 2 fele.csfasta file.fastq > file.sai
$ bwa samse file.csfasta file.sai file.fastq > file.sam

Running a single batch job on Biowulf

1. Create a script file. The file will contain the lines similar to the lines below.

#!/bin/bash


module load bwa
cd /data/$USER/dir
bwa index -a bwtsw file.csfasta
bwa aln -t $SLURM_CPUS_PER_TASK file.csfasta file.fastq > file.sai
bwa samse file.csfasta file.sai file.fastq > file.sam

2. Submit the script on biowulf:

$ sbatch --cpus-per-task=4 jobscript

The number assigned to --cpus-per-task flag will automatically assign to $SLURM_CPUS_PER_TASK in the script.

You would, of course, modify required memory to the needs of your job.

$ sbatch --cpus-per-task=4 --mem=10g jobscript

Running a swarm of jobs on Biowulf

Setup a swarm command file:

  cd /data/$USER/dir1; bwa aln -t $SLURM_CPUS_PER_TASK file.csfasta file.fastq > file.sai
  cd /data/$USER/dir2; bwa aln -t $SLURM_CPUS_PER_TASK file.csfasta file.fastq > file.sai
  cd /data/$USER/dir3; bwa aln -t $SLURM_CPUS_PER_TASK file.csfasta file.fastq > file.sai
	[......]
  

Submit the swarm file, -f specify the swarmfile name, -t is used to specify the same thread number as the bwa flag '-t', and --module will be loaded the required module for each command line in the file:

  $ swarm -t 4 -f swarmfile --module bwa

If a larger BWA index file were used, then the amount of memory per bwa process must be increased using the -g option:

  $ swarm -t 4 -g 10 -f swarmfile --module bwa

For more information regarding running swarm, see swarm.html

Running an interactive job on Biowulf

It may be useful for debugging purposes to run jobs interactively. Such jobs should not be run on the Biowulf login node. Instead allocate an interactive node as described below, and run the interactive job there.

biowulf$ sinteractive --cpus-per-task=4 
salloc.exe: Granted job allocation 16535

cn999$ module load bwa
cn999$ cd /data/$USER/dir
cn999$ bwa aln -t 4 file.csfasta file.fastq > file.sai
[...etc...]

cn999$ exit
exit

biowulf$

Make sure to exit the job once finished.

Documentation

To see a full listing of the options available for bwa, type bwa at the prompt.

$ bwa

Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.6.2-r126
Contact: Heng Li <lh3@sanger.ac.uk>

Usage:   bwa  [options]

Command: index         index sequences in the FASTA format
         aln           gapped/ungapped alignment
         samse         generate alignment (single ended)
         sampe         generate alignment (paired ended)
         bwasw         BWA-SW for long queries
         fastmap       identify super-maximal exact matches

         fa2pac        convert FASTA to PAC format
         pac2bwt       generate BWT from PAC
         pac2bwtgen    alternative algorithm for generating BWT
         bwtupdate     update .bwt to the new format
         bwt2sa        generate SA from BWT and Occ
         pac2cspac     convert PAC to color-space PAC
         stdsw         standard SW/NW alignment

Then for each command, individual help page can be displayed by:

$ bwa index

Usage:   bwa index [-a bwtsw|is] [-c] 

Options: -a STR    BWT construction algorithm: bwtsw or is [auto]
         -p STR    prefix of the index [same as fasta name]
         -6        index files named as .64.* instead of .* 

Warning: `-a bwtsw' does not work for short genomes, while `-a is' and
         `-a div' do not work not for long genomes. Please choose `-a'
         according to the length of the genome.

http://bio-bwa.sourceforge.net/bwa.shtml