Biowulf High Performance Computing at the NIH
Blat on Biowulf

BLAT is a DNA/Protein Sequence Analysis program written by Jim Kent at UCSC. It is designed to quickly find sequences of 95% and greater similarity of length 40 bases or more. It may miss more divergent or shorter sequence alignments. It will find perfect sequence matches of 33 bases, and sometimes find them down to 22 bases. BLAT on proteins finds sequences of 80% and greater similarity of length 20 amino acids or more. In practice DNA BLAT works well on primates, and protein blat on land vertebrates.

References & Documentation
Important Notes

Easyblat on Biowulf

Easyblat is a convenient command-line interface for running Blat on a large number of query sequences.
Put all your query sequences into a directory, and then type 'easyblat' at the Biowulf prompt. You will be prompted for all required parameters.

Sample session (user input in bold):

[user@biowulf]$  easyblat

EasyBLAT: BLAT (not Blast!) for large numbers of sequences
Enter the directory which contains your input sequences: 100n

Enter the directory where you want your BLAT output to go: out

The following databases are available:
  H - Human Genome hg19 Feb 2009 assembly 
  M - Mouse Genome mm9 Dec 2010 assembly 
  O - Other databases
Enter H, M or O for a detailed list: h
Human Genome (hg19, Feb 2009) assembly:
    chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11
    chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, 
    chr21, chr22, chrX, chrY, chr_all
Enter human section to run against: chrX

Any additional BLAT parameters (e.g. -maxGap=3): -maxGap=3

 Want an email message when the job ends? (y/n default n): y
Database /fdb/genome/hg19/chrX.fa is 0.16 GB. Requesting 2 GB for each Blat process
Submitting with: /usr/local/bin/swarm  -f /home/susanc/blat_30338.swarm -g 2 --job-name=EBlat-11Feb2015-1118
Job 13983 has been submitted.
Cleanup/summary job 13984 has been submitted.

You can run against your own database (any fasta format file) by selecting 'other databases', and then entering the full pathname of the database you want to search. For example:

The following databases are available:
  H - Human Genome hg19 Feb 2009 assembly
  M - Mouse Genome mm10 Dec 2010 assembly
  O - Other databases
Enter H, M or O for a detailed list: O
Other databases, updated weekly:
    pdb - from the PDB 3-dimensional structures
    drosoph - Drosophila sequences
    ecoli - E. Coli sequences
    mito - mitochondrial sequences
    yeast - Yeast sequences

If using your own database, enter the full pathname.
Enter db to run against: /data/$USER/my_db.fas

Interactive job
Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run the program. Sample session:

[user@biowulf]$ sinteractive
salloc.exe: Pending job allocation 46116226
salloc.exe: job 46116226 queued and waiting for resources
salloc.exe: job 46116226 has been allocated resources
salloc.exe: Granted job allocation 46116226
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3144 are ready for job

[user@cn3144 ~]$ module load blat

[user@cn3144 ~]$ faToNib gi_22507416.fasta gi_22507416.nib

[user@cn3144 ~]$ blat /fdb/fastadb/hs_genome.rna.fas gi_22507416.nib out.psl
    Loaded 108585020 letters in 42753 sequences
    Searched 1238 bases in 1 sequences 

[user@cn3144 ~]$ exit
salloc.exe: Relinquishing job allocation 46116226
[user@biowulf ~]$

Batch job
Most jobs should be run as batch jobs.

Create a batch input file (e.g. Blat.sh). Since Blat runs spend significant time reading in the database, it is most efficient to

For example:

#!/bin/bash
set -e
module load Blat
module load blat
blat    /fdb/fastadb/est_human.fas  gi_22507416.nib   out1.psl 
blat    /fdb/fastadb/est_human.fas   gi_22507417.nib   out2.psl 
blat    /fdb/fastadb/est_human.fas   gi_22507418.nib   out3.psl 

Submit this job using the Slurm sbatch command. To determine the best memory allocation, check the size of the fasta-format database. e.g.

biowulf% ls -lh /fdb/fastadb/est_human.fas
-rw-rw-r-- 1 helixapp helixapp 5.1G Mar 21  2017 /fdb/fastadb/est_human.fas
Therefore, an appropriate memory allocation for this Blat run would be 7 or 8 GB.
sbatch --mem=8g Blat.sh

Note that easyblat, described above, will automatically select the appropriate memory allocation for your jobs.

Swarm of Jobs
A swarm of jobs is an easy way to submit a set of independent commands requiring identical resources.

Create a swarmfile (e.g. Blat.swarm). For example:

blat    /fdb/fastadb/est_human.fas   gi_22507416.nib   out1.psl 
blat    /fdb/fastadb/est_human.fas   gi_22507417.nib   out2.psl 
blat    /fdb/fastadb/est_human.fas   gi_22507418.nib   out3.psl 

Submit this job using the swarm command.

swarm -f Blat.swarm -g 8  --module blat
where
-g # Number of Gigabytes of memory required for each process (1 line in the swarm command file)
--module blat Loads the blat module for each subjob in the swarm