High-Performance Computing at the NIH
GitHub YouTube @nih_hpc RSS Feed
Blat on Biowulf & Helix

BLAT is a DNA/Protein Sequence Analysis program written by Jim Kent at UCSC. It is designed to quickly find sequences of 95% and greater similarity of length 40 bases or more. It may miss more divergent or shorter sequence alignments. It will find perfect sequence matches of 33 bases, and sometimes find them down to 22 bases. BLAT on proteins finds sequences of 80% and greater similarity of length 20 amino acids or more. In practice DNA BLAT works well on primates, and protein blat on land vertebrates. For more information see the BLAT paper or Jim Kent's web page.

Use 'module load blat' to add the latest version of the Blat executables to your path.

Blat on Helix

Sample session:

helix% module load blat
helix% blat /fdb/genome/mouse-mar2006/chrX.fa ./nuc.fasta output.psl
Loaded 165556469 letters in 1 sequences
Searched 1350 bases in 1 sequences

helix%

Easyblat on Biowulf

Easyblat is a convenient script interface for running large BLAT jobs. You need to put all your query sequences into a directory, and then type 'easyblat' at the Biowulf prompt. You will be prompted for all required parameters.

Sample session:

[susanc@biowulf bench]$ easyblat

EasyBLAT: BLAT (not Blast!) for large numbers of sequences
Enter the directory which contains your input sequences: 100n

Enter the directory where you want your BLAT output to go: out

The following databases are available:
  H - Human Genome hg19 Feb 2009 assembly 
  M - Mouse Genome mm9 Jul 2007 assembly 
  O - Other databases
Enter H, M or O for a detailed list: h
Human Genome (hg19, Feb 2009) assembly:
    chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11
    chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, 
    chr21, chr22, chrX, chrY, chr_all
Enter human section to run against: chrX

Any additional BLAT parameters (e.g. -maxGap=3): -maxGap=3

 Want an email message when the job ends? (y/n default n): y
Database /fdb/genome/human-feb2009/chrX.fa is 0.16 GB. Requesting 2 GB for each Blat process
Submitting with: /usr/local/bin/swarm  -f /home/susanc/blat_30338.swarm -g 2 --job-name=EBlat-11Feb2015-1118
Job 13983 has been submitted.
Cleanup/summary job 13984 has been submitted.

You can run against your own database (any fasta format file) by selecting 'other databases', and then entering the full pathname of the database you want to search. For example:

The following databases are available:
  H - Human Genome (Apr 2006) assembly 
  M - Mouse Genome (Jul 2007) assembly 
  O - Other databases
Enter H, M or O for a detailed list: O
Other databases, updated weekly:
    pdb - from the PDB 3-dimensional structures
    drosoph - Drosophila sequences
    ecoli - E. Coli sequences
    mito - mitochondrial sequences
    yeast - Yeast sequences

If using your own database, enter the full pathname.
Enter db to run against: /data/user/my_db.fas

You can run against your own database (any fasta format file) by selecting 'other databases', and then entering the full pathname of the database you want to search. For example:

The following databases are available:
  H - Human Genome (Apr 2006) assembly 
  M - Mouse Genome (Jul 2007) assembly 
  O - Other databases
Enter H, M or O for a detailed list: O
Other databases, updated weekly:
    pdb - from the PDB 3-dimensional structures
    drosoph - Drosophila sequences
    ecoli - E. Coli sequences
    mito - mitochondrial sequences
    yeast - Yeast sequences

If using your own database, enter the full pathname.
Enter db to run against: /data/user/my_db.fas

Running a swarm of Blat batch jobs on Biowulf

Easyblat uses swarm. If you prefer to run swarm directly, set up a swarm command file along the following lines:

# this file is called blatcmd
# commands are 'blat  database_file  query_sequence  outputfile'
#
blat /fdb/genome/mm9/chr_all.fa  /data/user/myseqs/seq1.fas /data/user/blatout/seq1.out
blat /fdb/genome/mm9/chr_all.fa  /data/user/myseqs/seq2.fas /data/user/blatout/seq2.out
blat /fdb/genome/mm9/chr_all.fa  /data/user/myseqs/seq3.fas /data/user/blatout/seq3.out
blat /fdb/genome/mm9/chr_all.fa  /data/user/myseqs/seq4.fas /data/user/blatout/seq4.out
[...]

The memory required for each blat command will be approximately the size of the database file. I n this case, the file chr_all.fa is about 2.6 GB

[user@biowulf ]$ ls -lh /fdb/genome/mm9/chr_all.fa
-rw-rw-r-- 1 helixapp staff 2.6G Mar 25  2008 /fdb/genome/mm9/chr_all.fa

Thus, we can estimate that each blat command requires 3 GB of memory. Submit this swarm job with:

swarm -g 3 -f blatcmd

Important Notes

BLAT - The Blast-Like Alignment Tool. W. James Kent, Genome Research 12(4): 656-664, April 2002
BLAT Suite Program Specifications and User Guide. at the UCSC Genome website. All BLAT options are listed on this page.


BLAT FAQ