Biowulf High Performance Computing at the NIH
Blast on Biowulf

The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families.

Blast was developed at NCBI, NIH. (Blast website)

BLAST on Biowulf is intended for running a large number of sequence files, such as hundreds or thousands of query sequences, against the Blast databases. If you have just a few query sequences, you should use Blast on the NCBI website.

Both Blast+ and legacy Blast 2.2.26 are available on this system. To see all available versions, use the modules commands as in the example below.

[user@biowulf ~]$ module avail blast

-------------------------- /usr/local/lmod/modulefiles -------------------------
   blast/2.2.26    blast/2.2.30+ (D)

   (D):  Default Module
[user@biowulf ~]$  module load blast

[user@biowulf ~]$ module list

Currently Loaded Modules:
  1) blast/2.2.30+
[user@biowulf ~]$ makeblastdb
  makeblastdb [-h] [-help] [-in input_file] [-input_type type]
    -dbtype molecule_type [-title database_title] [-parse_seqids]
    [-hash_index] [-mask_data mask_data_files] [-mask_id mask_algo_ids]
    [-mask_desc mask_algo_descriptions] [-gi_mask]
    [-gi_mask_name gi_based_mask_names] [-out database_name]
    [-max_file_sz number_of_bytes] [-logfile File_Name] [-taxid TaxID]
    [-taxid_map TaxIDMapFile] [-version]

   Application to create BLAST databases, version 2.2.30+

Use '-help' to print detailed descriptions of command line arguments

Loading a Blast module will add the blast programs and database-formatting programs to your path.


Easyblast is an easy interface to Blast on Biowulf. It is a wrapper script that will prompt you for all required parameters, set up your jobs appropriately and submit them to the cluster. You will need to have all your query sequences in multiple files in a single directory (multiple sequences per file is fine).

Easyblast will run the latest version of Blast+. The version will be printed at the beginning of the Easyblast run, and will also appear in the output files.

Note that Easyblast on Biowulf uses the latest Blast+ by default, not Legacy Blast 2.2.26
There are significant changes between the 2 versions.
If you want to run Legacy Blast using the old-style parameters like '-b 1 -v 1', use 'easyblast_legacy' instead of 'easyblast'.

Sample session: (user input in bold):

[user@biowulf ~]$ easyblast              [use easyblast_legacy to run Blast 2.2.26]

EasyBlast: Blast 2.9.0+ for large numbers of sequences
Enter the directory which contains your input sequences: /data/user/blast/bench/100n

Enter the directory where you want your Blast output to go: /data/user/blast/bench/out
** WARNING: There are already files in /data/user/blast/bench/out which will be deleted by this job.
** Continue? (y/n) : y
Cleaning up output directory...

BLAST programs:
    blastn - nucleotide query sequence against nucleotide database
    blastp - protein query sequence against protein database
    blastx - nucleotide query translated in all 6 reading frames
          against a protein database
    tblastn - protein query sequence against a nucleotide database
          translated in all 6 reading frames
    tblastx - 6-frame translations of a nucleotide query sequence
          against the 6-frame translations of a nucleotide database
     rpsblast - protein query sequence against protein domain database
    rpstblastn - nucleotide query sequence against protein domain database
Which program do you want to run: blastn

The following nucleotide databases are available:
(or enter your own database with full pathname)
    nt - NCBI nonredundant Genbank+EMBL+DDBJ+PDB (no EST, STS, GSS or HTG)
    est_human - nonredundant Genbank+EMBL+DDBJ EST human sequences
    est_mouse - nonredundant Genbank+EMBL+DDBJ EST mouse sequences
    est_others - nonredundant Genbank+EMBL+DDBJ EST all other organisms
    pdbnt - from the 3-dimensional structures
    patnt - patent sequences
    htgs - high throughput genome sequences
    mito.nt - mitochondrial sequences
    yeast.nt - yeast (Saccharomyces cerevisiae) genomic sequences
    drosoph.nt - drosophila sequences
    hs_genome - human genome assembly (Build 37, hg19, Feb 2009)
    hs_genome.masked - human genome masked (Build 37, hg19, Feb 2009)
    hs_genome.rna - human genome RNA (Build 37, hg19, Feb 2009)
    human_genomic - human genomic sequences from NCBI
    mm10 - Mouse genome, (mm10)
    mm10.masked - Mouse genome, masked (mm10)
    mm10.rna - Mouse genome RNA (mm10)
    other_genomic - non-human genomic sequences
    human.rna - RefSeq human RNA
    mouse.rna - RefSeq mouse RNA
    16SMicrobial - Microbial 16S rRNA
    rhesus - Rhesus genome
    viral - Virus sequences
    wgs - Whole-Genome-Shotgun contigs
Database to run against: est_human
Database /fdb/blastdb/est_human is 1.16 GB. By default, easyblast will allocate 11 GB memory on each node. Some jobs may require additional memory.
Enter a memory allocation in GB, or leave blank to accept the default 11:

Local disk scratch space allocation default allocation is 380 GB. Some jobs may require additional scratch space.
Enter a scratch space allocation in GB, or leave blank to accept the default 380:
Any additional Blast parameters (e.g. -task megablast -word_size 8 -ungapped ):
Blast parameters :  -num_threads 2

 Want an email message when the job ends? (y/n default n): y
#jobs = 1, #seqfiles = 100, #seqs/job = 100
Creating file /data/user/blast/bench/out/scripts/1
Submitting with: /usr/local/bin/swarm  --logdir=/data/user/blast/bench/out/swarm_logs --silent -f /data/user/blast/bench/out/scripts/swarmfile -t 32 -g 11 --time=8:00:00 --job-name EBlast-03Apr2019-0956 --gres=lscratch:380
Job 23675043 has been submitted.
Cleanup/summary job 23675044 has been submitted.

When the job completes, the output files will appear in your specified output directory. If you requested a summary, a file called 'summary' will also be in this directory. It will contain the hits from each Blast result (no alignments) so that you can scroll through it easily.

To run against your own database, enter the db name with full path at the Database: prompt. For example:

       Database to run against: /data/username/blast_db/my_db
Database files have suffixes like .nsq, .nin (nucleotide), .psq, .psi (protein) etc. You should enter the full path and the database name without the suffix. Thus, if your database files are called my_db.nsq etc., enter the database name as my_db.

You can put multiple sequences into each of your input sequence files. However, if you have only 1 sequence file containing a large number of sequences, this will run on 1 node, with all the Blast runs performed sequentially. You can better utilize the parallel processing power of Biowulf by dividing your sequences into a larger number of files.

If your query sequences are all in one file, and you need to split them into multiple sequence files, there are a couple of utilities available:

BLAST Databases
Local copies of the sequence databases used by blast can be found in the directory /fdb/blastdb. These data are a (weekly-updated) mirror of the directory maintained by NCBI. Some dbs were built from Fasta-format files, with the command:

formatdb -o T

Running via swarm

Easyblast uses swarm. If you prefer to use swarm directly, you can use easyblast to set up a swarm command file, but submit the swarm yourself.

Use the '-d' flag to easyblast to set up the swarm command file but not submit it. Example:

[user@biowulf ~]$ easyblast -d

Easyblast running in debug mode. The swarm command file will be generated, but it is left to you to submit it as desired.

EasyBlast: Blast 2.2.30+ for large numbers of sequences
Enter the directory which contains your input sequences: /data/user/blast/bench/10n

[....usual easyblast run....]

---------- Easyblast running in debug mode, no swarm submitted -------------------
Swarm command file created in /home/user/blast_13485.swarm
Recommended swarm submission command
swarm --singleout -f /home/user/blast_13485.swarm --module blast -g 5

You can now edit this swarm command file, or submit it requesting more than the default (5G in the example above) memory. For example

swarm --singleout -f /home/user/blast_13485.swarm --module blast -g 10

Of course, you can set up a swarm command file yourself and avoid easyblast altogether, along the following lines:

# this file is called blastcmd
blastn -db /fdb/blastdb/nt -query /data/user/myseqs/seq1.fas -out /data/user/blastout/seq1.out
blastn -db /fdb/blastdb/nt -query /data/user/myseqs/seq2.fas -out /data/user/blastout/seq2.out
blastn -db /fdb/blastdb/nt -query /data/user/myseqs/seq3.fas -out /data/user/blastout/seq3.out
blastn -db /fdb/blastdb/nt -query /data/user/myseqs/seq4.fas -out /data/user/blastout/seq4.out

Determine the size of the database file (see the section on Blast jobs and memory). Let's assume the database is 8.3 GB. Round upwards to 9 GB. You should submit the swarm requesting 9 GB with the '-g 9' flag.
Submit this swarm with

swarm -g 9  -f blastcmd --module blast/2.2.26

Note that the Blast + versions (e.g. Blast 2.2.27+) do not include a 'blastall' executable. [Brief summary of differences between '+' and legacy versions]

Blast Database Update Status -- status of all Blast databases installed on the system.

Single Blast job

You can also submit a single Blast job. Your batch script would be along the following lines:


set -e
module load blast
blastx -db /fdb/blastdb/nr -query /data/user/myseqs/seq1.fas -out /data/user/blastout/seq1.out [...other Blast parameters....]

If your query sequence file has multiple sequences, it is important to allocate enough memory so that the entire database fits into the allocated memory. If you do not allocate enough memory, the Blast program will need to re-read the database multiple times, which will cause huge I/O load and significantly slow down your job. In the example above, the database is 'nr'. Based on the calculations described below, the nr database is about 50 GB. So this job should be submitted requesting about 60 GB.

sbatch --mem=60gb

Blast Databases

A large number of Blast-formatted databases are centrally installed and maintained by the HPC staff. See the database page for a current list, including date of last update.

Blast jobs and Memory

If you set up your own Blast jobs rather than using Easyblast, note that it is fastest and most efficient if the jobs allocate enough memory for the Blast database you're using. If there is insufficient memory, the database will be read over and over from disk, which is I/O intensive and will slow down your job. It can also cause huge filesystem problems if you are running a lot of jobs, and you may get contacted by the Biowulf staff.

Calculate the memory required by totalling the size of the .nsq (for nucleotide databases) or .psq (for protein databases) files. e.g.

biowulf% du -sh --total /fdb/blastdb/patnt*.nsq
915M    /fdb/blastdb/patnt.00.nsq
858M    /fdb/blastdb/patnt.01.nsq
809M    /fdb/blastdb/patnt.02.nsq
952M    /fdb/blastdb/patnt.03.nsq
958M    /fdb/blastdb/patnt.04.nsq
75M     /fdb/blastdb/patnt.05.nsq
4.5G    total
The total of these 6 files is about 4.5 GB. Swarm jobs should therefore be submitted with the '-g 5' flag, or better yet, '-g 6' for safety.

Easyblast will perform these calculations before submitting the job.