High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
Blast on Biowulf & Helix

The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families.

Blast was developed at NCBI, NIH. (Blast website)

BLAST on Biowulf is intended for running a large number of sequence files, such as hundreds or thousands of query sequences, against the Blast databases. If you have just a few query sequences, you should use Blast on the NCBI website.

Both Blast+ and legacy Blast 2.2.26 are available on this system. To see all available versions, use the modules commands as in the example below.

[susanc@biowulf ~]$ module avail blast

-------------------------- /usr/local/lmod/modulefiles -------------------------
   blast/2.2.26    blast/2.2.30+ (D)

  Where:
   (D):  Default Module
   
[susanc@biowulf ~]$  module load blast

[susanc@biowulf ~]$ module list

Currently Loaded Modules:
  1) blast/2.2.30+
  
[susanc@biowulf ~]$ makeblastdb
USAGE
  makeblastdb [-h] [-help] [-in input_file] [-input_type type]
    -dbtype molecule_type [-title database_title] [-parse_seqids]
    [-hash_index] [-mask_data mask_data_files] [-mask_id mask_algo_ids]
    [-mask_desc mask_algo_descriptions] [-gi_mask]
    [-gi_mask_name gi_based_mask_names] [-out database_name]
    [-max_file_sz number_of_bytes] [-logfile File_Name] [-taxid TaxID]
    [-taxid_map TaxIDMapFile] [-version]

DESCRIPTION
   Application to create BLAST databases, version 2.2.30+

Use '-help' to print detailed descriptions of command line arguments
========================================================================

Loading a Blast module will add the blast programs and database-formatting programs to your path.

Easyblast

Easyblast is an easy interface to Blast on Biowulf. It is a wrapper script that will prompt you for all required parameters, set up your jobs appropriately and submit them to the cluster. You will need to have all your query sequences in multiple files in a single directory (multiple sequences per file is fine).

Easyblast will run the latest version of Blast+. The version will be printed at the beginning of the Easyblast run, and will also appear in the output files.

Note that Easyblast on Biowulf uses the latest Blast+ by default, not Legacy Blast 2.2.26
There are significant changes between the 2 versions.
If you want to run Legacy Blast using the old-style parameters like '-b 1 -v 1', use 'easyblast_legacy' instead of 'easyblast'.

Sample session: (user input in bold):

[susanc@biowulf ~]$ easyblast              [use easyblast_legacy to run Blast 2.2.26]

EasyBlast: Blast 2.2.30+ for large numbers of sequences
Enter the directory which contains your input sequences: /data/susanc/blast/bench/100n

Enter the directory where you want your Blast output to go: /data/susanc/blast/bench/out
** WARNING: There are already files in /data/susanc/blast/bench/out which will be deleted by this job.
** Continue? (y/n) : y
Cleaning up output directory...

BLAST programs:
    blastn - nucleotide query sequence against nucleotide database
    blastp - protein query sequence against protein database
    blastx - nucleotide query translated in all 6 reading frames
          against a protein database
    tblastn - protein query sequence against a nucleotide database
          translated in all 6 reading frames
    tblastx - 6-frame translations of a nucleotide query sequence
          against the 6-frame translations of a nucleotide database
     rpsblast - protein query sequence against protein domain database
    rpstblastn - nucleotide query sequence against protein domain database
Which program do you want to run: blastn

The following nucleotide databases are available:
(or enter your own database with full pathname)
    nt - NCBI nonredundant Genbank+EMBL+DDBJ+PDB (no EST, STS, GSS or HTG)
    est_human - nonredundant Genbank+EMBL+DDBJ EST human sequences
    est_mouse - nonredundant Genbank+EMBL+DDBJ EST mouse sequences
    est_others - nonredundant Genbank+EMBL+DDBJ EST all other organisms
    pdbnt - from the 3-dimensional structures
    patnt - patent sequences
    htgs - high throughput genome sequences
    mito.nt - mitochondrial sequences
    yeast.nt - yeast (Saccharomyces cerevisiae) genomic sequences
    drosoph.nt - drosophila sequences
    hs_genome - human genome assembly (Build 37, hg19, Feb 2009)
    hs_genome.masked - human genome masked (Build 37, hg19, Feb 2009)
    hs_genome.rna - human genome RNA (Build 37, hg19, Feb 2009)
    human_genomic - human genomic sequences from NCBI
    mm10 - Mouse genome, (mm10)
    mm10.masked - Mouse genome, masked (mm10)
    mm10.rna - Mouse genome RNA (mm10)
    other_genomic - non-human genomic sequences
    human.rna - RefSeq human RNA
    mouse.rna - RefSeq mouse RNA
    16SMicrobial - Microbial 16S rRNA
    rhesus - Rhesus genome
    viral - Viral sequences
    wgs - Whole-Genome-Shotgun contigs
Database to run against: viral
Database /fdb/blastdb/viral.nal is 0.81 GB. Requesting 5 GB memory for each Blast process
Any additional Blast parameters (e.g. -task megablast -word_size 8 -ungapped ):

 Want an email message when the job ends? (y/n default n): y
Submitting with: /usr/local/bin/swarm --singleout --silent -f /home/susanc/blast_32568.swarm -g 5 --time=24:00:00 --job-name EBlast-05Aug2015-1154
Job 1068867 has been submitted.
Cleanup/summary job 1068868 has been submitted.
[susanc@biowulf ~]$

When the job completes, the output files will appear in your specified output directory. If you requested a summary, a file called 'summary' will also be in this directory. It will contain the hits from each Blast result (no alignments) so that you can scroll through it easily.

To run against your own database, enter the db name with full path at the Database: prompt. For example:

       Database to run against: /data/username/blast_db/my_db
Database files have suffixes like .nsq, .nin (nucleotide), .psq, .psi (protein) etc. You should enter the full path and the database name without the suffix. Thus, if your database files are called my_db.nsq etc., enter the database name as my_db.

You can put multiple sequences into each of your input sequence files. However, if you have only 1 sequence file containing a large number of sequences, this will run on 1 node, with all the Blast runs performed sequentially. You can better utilize the parallel processing power of Biowulf by dividing your sequences into a larger number of files.

If your query sequences are all in one file, and you need to split them into multiple sequence files, there are a couple of utilities available:

BLAST Databases
Local copies of the sequence databases used by blast can be found in the directory /fdb/blastdb. These data are a (weekly-updated) mirror of the ftp://ncbi.nlm.nih.gov/blast/db/ directory maintained by NCBI. Some dbs were built from Fasta-format files, with the command:

formatdb -o T

Running via swarm

Easyblast uses swarm. If you prefer to use swarm directly, you can use easyblast to set up a swarm command file, but submit the swarm yourself. Easyblast will, by default, determine the amount of memory required for each command based on the database size. In some cases, Blast runs may require more memory than the default, in which case you should submit the swarm yourself.

Use the '-d' flag to easyblast to set up the swarm command file but not submit it. Example:

[susanc@biowulf ~]$ easyblast -d

Easyblast running in debug mode. The swarm command file will be generated, but it is left to you to submit it as desired.

EasyBlast: Blast 2.2.30+ for large numbers of sequences
Enter the directory which contains your input sequences: /data/susanc/blast/bench/10n

[....usual easyblast run....]

---------- Easyblast running in debug mode, no swarm submitted -------------------
Swarm command file created in /home/susanc/blast_13485.swarm
Recommended swarm submission command
swarm --singleout -f /home/susanc/blast_13485.swarm --module blast -g 5

You can now edit this swarm command file, or submit it requesting more than the default (5G in the example above) memory. For example

swarm --singleout -f /home/susanc/blast_13485.swarm --module blast -g 10

Of course, you can set up a swarm command file yourself and avoid easyblast altogether, along the following lines:

# this file is called blastcmd
#
blastn -db /fdb/blastdb/nt -query /data/user/myseqs/seq1.fas -out /data/user/blastout/seq1.out
blastn -db /fdb/blastdb/nt -query /data/user/myseqs/seq2.fas -out /data/user/blastout/seq2.out
blastn -db /fdb/blastdb/nt -query /data/user/myseqs/seq3.fas -out /data/user/blastout/seq3.out
blastn -db /fdb/blastdb/nt -query /data/user/myseqs/seq4.fas -out /data/user/blastout/seq4.out
[...]

Determine the size of the database file (see the section on Blast jobs and memory). Let's assume the database is 8.3 GB. Round upwards to 9 GB. You should submit the swarm requesting 9 GB with the '-g 9' flag.
Submit this swarm with

swarm -g 9  -f blastcmd --module blast/2.2.26

Note that the Blast + versions (e.g. Blast 2.2.27+) do not include a 'blastall' executable. [Brief summary of differences between '+' and legacy versions]

Blast Database Update Status -- status of all Blast databases installed on the system.

Blast Databases

A large number of Blast-formatted databases are centrally installed and maintained by the HPC staff. See the database page for a current list, including date of last update.

Blast jobs and Memory

If you set up your own Blast jobs rather than using Easyblast, note that it is fastest and most efficient if the jobs allocate enough memory for the Blast database you're using. If there is insufficient memory, the database will be read over and over from disk, which is I/O intensive and will slow down your job.

Calculate the memory required by totalling the size of the .nsq (for nucleotide databases) or .psq (for protein databases) files. e.g.

biowulf% ls -l /fdb/blastdb/patnt*.nsq
-rw-rw-r-- 1 helixapp staff  978015168 May 10 08:29 /fdb/blastdb/patnt.00.nsq
-rw-rw-r-- 1 helixapp staff  933912412 May  6 04:33 /fdb/blastdb/patnt.01.nsq
-rw-rw-r-- 1 helixapp staff  865727460 May  6 04:31 /fdb/blastdb/patnt.02.nsq
-rw-rw-r-- 1 helixapp staff  386808628 May  6 04:26 /fdb/blastdb/patnt.03.nsq
The total of these 4 files is about 3 GB. Swarm jobs should therefore be submitted with the '-t 3' flag.

Easyblast will perform these calculations before submitting the job.

Documentation