High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
Bfast on Biowulf & Helix

Blat-like Fast Accurate Search Tool (BFAST) facilitates the fast and accurate mapping of short reads to reference sequences. Some advantages of BFAST include:

- Speed: enables billions of short reads to be mapped quickly.
- Accuracy: A priori probabilities for mapping reads with defined set of variants.
- An easy way to measurably tune accuracy at the expense of speed.

Specifically, BFAST was designed to facilitate whole-genome resequencing, where mapping billions of short reads with variants is of utmost importance. BFAST supports both Illumina and ABI SOLiD data, as well as any other Next-Generation Sequencing Technology (454, Helicos), with particular emphasis on sensitivity towards errors, SNPs and especially indels. Other algorithms take short-cuts by ignoring errors, certain types of variants (indels), and even require further alignment, all to be the "fastest" (but still not complete). BFAST is able to be tuned to find variants regardless of the error-rate, polymorphism rate, or other factors.

Bfast+BWA can also be accessed by using 'module load bfast+bwa'. This can be used to replace part of the Bfast steps. Its documentation is at http://www.nilshomer.com/index.php?title=BFAST_with_BWA

Running on Helix

$ module load bfast
$ cd /data/$USER/dir
$ ill2fastq.pl -q s <N>
$ bfast fasta2brg -f hg18.fa
$ bfast index -f hg18.fa -m <mask> -w 14 -i <index number>
$ bfast match -f hg18.fa -r reads.s <N>.fastq > bfast.matches.file.s <N>.bmf
$ bfast localalign -f hg18.fa -m bfast.matches.file.s <N>.bmf > bfast.aligned.file.s <N>.baf
$ bfast postprocess -f hg18.fa -i bfast.aligned.file.s <N>.baf > bfast.reported.file.s <N>.sam

Running a single batch job on Biowulf

1. Create a script file. The file will contain the lines similar to the lines below.

#!/bin/bash


module load bfast
cd /data/$USER/dir

bfast localalign -n $SLURM_CPUS_PER_TASK -f hg18.fa -m bfast.matches.file.s <N>.bmf > bfast.aligned.file.s <N>.baf

The above example used the '-n' option for multi-threaded alignment.

2. Submit the script on biowulf:

$ sbatch --cpus-per-task=4 script

The job will be submitted to 4 cpus and the $SLURM_CPUS_PER_TASK will be assigned 4 automatically in the script.

If more momory is required (default 4gb), specify --mem=Mg, for example --mem=10g.

Running a swarm of jobs on Biowulf

Setup a swarm command file:

  cd /data/$USER/dir1; bfast localalign -n $SLURM_CPUS_PER_TASK -f hg18.fa -m bfast.matches.file.s <N>.bmf > bfast.aligned.file.s <N>.baf
  cd /data/$USER/dir2; bfast localalign -n $SLURM_CPUS_PER_TASK -f hg18.fa -m bfast.matches.file.s <N>.bmf > bfast.aligned.file.s <N>.baf
  cd /data/$USER/dir2; bfast localalign -n $SLURM_CPUS_PER_TASK -f hg18.fa -m bfast.matches.file.s <N>.bmf > bfast.aligned.file.s <N>.baf
	[......]
  

Submit the swarm file, -t specify the thread number which will be assigned to $SLURM_CPUS_PER_TASK in the script automatically, -f specify the swarmfile name, and module bfast will be loaded for each command line in the file:

  $ swarm -t 4 -f swarmfile --module bfast

If more memory is needed for each line of commands, the below example allocate 10g for each command:

  $ swarm -t 4 -f swarmfile -g 10 --module bfast

Running an interactive job on Biowulf

It may be useful for debugging purposes to run jobs interactively. Such jobs should not be run on the Biowulf login node. Instead allocate an interactive node as described below, and run the interactive job there.

biowulf$ sinteractive 
salloc.exe: Granted job allocation 16535

cn999$ module load bfast
cn999$ cd /data/$USER/dir
cn999$ bfast localalign -f hg18.fa -m bfast.matches.file.s <N>.bmf > bfast.aligned.file.s <N>.baf
[...etc...]

cn999$ exit
exit

biowulf$

Make sure to exit the job once finished.

If more memory is needed, use --mem. For example

biowulf$ sinteractive --mem=8g

Documentation

bfast-book.pdf