High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
bbtools on Biowulf & Helix

Description

This extensive package of tools is being developed by Brian Bushnell. It includes

and many more.

On biowulf all bbtools are used through a wrapper script that automatically passes correct memory information to each of the tools. This is important as the automatic memory detection by the individual tools is not slurm aware and reports incorrect amounts of available memory.

Usage of the wrapper script:

NAME
       bbtools - BBMap short read aligner and other bioinformatic tools

SYNOPSIS
       bbtools command [options]

DESCRIPTION
       bbtools  is  a  convenient  frontend to the collection of bioinformatic
       tools created by Brian Bushnell at  the  Joint  Genome  Institute.   It
       includes  a short read mapper, a k-mer based normalization tool, refor-
       matting tools, and many more. The wrapper script will automatically set
       an  appropriate  maximum memory for the JVM executing the code. It will
       limit runs on helix to 10GB and use  SLURM  to  determine  the  correct
       amount of memory for batch/interactive jobs.

COMMANDS
       help    display this help message

       man     list  all  available commands with their descrcription and gen-
               eral usage information

       list    list all available commands

       All other commands  are  essentially  the  name  of  the  corresponding
       bbtools script without the extension.

OPTIONS
       Options  are  tool  specific.  See tool documentation for more details.
       use --help to get tool specific help.

AUTHOR OF WRAPPER SCRIPT
        Wolfgang Resch. Contact staff@hpc.nih.gov for help.

Use bbtools man to see an overview of all available tools as well as some general documentation.

There may be multiple versions of bbtools available. An easy way of selecting the version is to use modules. To see the modules available, type

module avail bbtools 

To select a module use

module load bbtools/[version]

where [version] is the version of choice.

Most bbtools applications are multithreaded. Make sure to match the number of cpus requested with the number of threads (or lower as bbtools often also starts subprocesses. Please read the docs for each tool carefully.

Environment variables set

Dependencies

The wrapper script takes care of loading a recent version of Java and samtools

Documentation

On Helix

Trim paired reads and make sure they have the correct quality scale. Write output in interleaved format as a single file. Also calculate histograms of base composition, length, and quality:

helix$ r1=/usr/local/apps/bbtools/TEST_DATA/read1_2000k.fastq.gz
helix$ r2=/usr/local/apps/bbtools/TEST_DATA/read2_2000k.fastq.gz
helix$ bbtools reformat in=${r1} in2=${r2} \
  out=read12_2000k.fastq.gz \
  qin=auto qout=33 ow=t cardinality=t \
  bhist=read12.bhist qhist=read12.qhist lhist=read12.lhist
Batch job on Biowulf

As an example for using a few of the tools we will download some bacterial genomes, simulate reads from these genomes in different ratios, apply kmer based normalization, align the raw and normalized reads to the genomes, and finally plot the coverage of the genomes before and after normalization.

This is all included in a single script for this example. For a real analysis, each of these steps should be carried out separately.

#! /bin/bash

ENSBACT="ftp://ftp.ensemblgenomes.org/pub/bacteria/release-30/fasta/bacteria"

module load bbtools || exit 1
module load R || exit 1

################################################################################
#                            copy bacterial genomes                            #
################################################################################
# from the TEST_DATA directory

mkdir -p data tmp out
cp /usr/local/apps/bbtools/TEST_DATA/drad.fa.gz data  # Deinococcus radiodurans
dr=data/drad.fa.gz
cp /usr/local/apps/bbtools/TEST_DATA/ecoli.fa.gz data # Eschericia coli
ec=data/ecoli.fa.gz
cp /usr/local/apps/bbtools/TEST_DATA/saur.fa.gz data  # Staphylococcus aureus
sa=data/saur.fa.gz

################################################################################
#                                simulate reads                                #
################################################################################
# simulate 50 nt reads with different coverage from each of the genomes. the
# amp parameter simulates amplification.

bbtools randomreads ref=$ec build=1 out=tmp/ec.fq length=50 \
    reads=2000000 seed=132565 snprate=0.05 amp=50
bbtools randomreads ref=$dr build=2 out=tmp/dr.fq length=50 \
    reads=4000000 seed=3565 snprate=0.05 amp=200
bbtools randomreads ref=$sa build=3 out=tmp/sa.fq length=50 \
    reads=12000000 seed=981 snprate=0.05 amp=200
cat tmp/ec.fq tmp/dr.fq tmp/sa.fq | gzip -c > data/mg_raw.fq.gz
rm tmp/ec.fq tmp/dr.fq tmp/sa.fq

################################################################################
#                              kmer normalization                              #
################################################################################
# normalize coverage and do error correction

bbtools bbnorm in=data/mg_raw.fq.gz out=data/mg_norm.fq.gz tmpdir=./tmp threads=$SLURM_CPUS_PER_TASK \
    target=20 k=25 ecc=t mindepth=2

################################################################################
#              align raw and normalized against the three genomes              #
################################################################################

bbtools bbsplit build=4 ref_ec=$ec ref_dr=$dr ref_sa=$sa
bbtools bbsplit build=4 in=data/mg_raw.fq.gz out_ec=out/mg_raw_ec.bam \
    out_dr=out/mg_raw_dr.bam out_sa=out/mg_raw_sa.bam refstats=out/mg_raw_stat \
    t=$SLURM_CPUS_PER_TASK
bbtools bbsplit build=4 in=data/mg_norm.fq.gz out_ec=out/mg_norm_ec.bam \
    out_dr=out/mg_norm_dr.bam out_sa=out/mg_norm_sa.bam refstats=out/mg_norm_stat \
    t=$SLURM_CPUS_PER_TASK

################################################################################
#                     calculate coverage for all 6 samples                     #
################################################################################
 
for f in out/mg_*.bam; do
    bbtools pileup in=$f out=${f%.bam}.covstat hist=${f%.bam}.covhist
done

# merge into a single file
rm -f out/mg_covhist.csv
for f in out/mg_*.covhist; do
    fld=( $(echo $f | tr '_.' '  ') )
    awk -v t=${fld[1]} -v o=${fld[2]} 'NR > 1 {{print $1","$2","t","o}}' $f \ 
        >> out/mg_covhist.csv
done

# plot the coverage data
echo > tmp/plot.R <<EOF
library(ggplot2)
dta <- read.table("out/mg_covhist.csv", sep=",", col.names=c("cov", "freq", "type", "org"))
dta <- subset(dta, cov < 400)
p <- ggplot(dta) +
    geom_line(aes(x=cov, y=freq, col=org), size=1) +
    scale_color_brewer(palette="Set1") +
    facet_wrap(~type, nrow=2, scale="free_x") +
    theme_bw(14) +
    theme(legend.position="top", panel.grid.minor=element_blank(), 
      panel.grid.major=element_blank(), panel.border=element_rect(colour="black"))
ggsave(p, file="out/mg_covhist.png")
EOF
Rscript tmp/plot.R

Submit to the queue with sbatch:

b2$ sbatch --mem=10G --cpus-per-task=6 examble.sh

This generates the following graphs demonstrating the effect of kmer based normalization:

normalization graph
Swarm of jobs on Biowulf

Create a swarm command file similar to the following example:

bbtools reformat in=a1.fq.gz in2=a2.fq.gz \
  out=a12.fq.gz \
  qin=auto qout=33 ow=t cardinality=t
bbtools reformat in=b1.fq.gz in2=b2.fq.gz \
  out=b12.fq.gz \
  qin=auto qout=33 ow=t cardinality=t
bbtools reformat in=c1.fq.gz in2=c2.fq.gz \
  out=c12.fq.gz \
  qin=auto qout=33 ow=t cardinality=t

And submit to the queue with swarm

b2$ swarm --module bbtools -f bbtools.swarm -t2 -g4
Interactive job on Biowulf

Allocate an interactive session with sinteractive and use as described above

b2$ sinteractive 
node$ module load bbtools
node$ bbtools reformat in=a1.fq.gz in2=a2.fq.gz \
  out=a12.fq.gz \
  qin=auto qout=33 ow=t cardinality=t
node$ exit
b2$