High-Performance Computing at the NIH
Scientific Databases

A set of centrally-maintained and updated scientific databases is made available to users of Helix and Biowulf. Click on a link below to see the available databases. [List of genomes]

Blastformatted for the NCBI Blast program. For use via command-line Blast on HPC systems.
FastaFasta-format flatfile databases used by Fasta, Blat and other programs.
EMBOSSformatted for the EMBOSS sequence analysis package. Can be accessed through the EMBOSS web interface or EMBOSS command line on any system.
MySQLAccessible through the HPC mirror of the UCSC Genome Browser. Also available for direct MySQL queries from the Biowulf cluster nodes.
PDBProtein Data Bank 3-D structures of macromolecules. Can be accessed at /pdb on any HPC system, or users can samba-mount the PDB database on their own systems.
CSDCambridge Structural Database. Can be accessed through WebCSD (NIH access only)
PFAMA large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs). Used by HMMER and other programs.
VCFVariant Call Format for genomic data
BAMBinary SAM files
.faiFasta index file produced by samtools faidx
.dictSequence dictionary file produced by Picard CreateSequenceDictionary
AnnotationsGenome annotations
IgenomesIllumina's Igenomes are a collection of reference sequences and annotation files for commonly analyzed organisms. More info at Illumina
Hisat indexesIndexes for the Hisat program for mapping RNA-seq reads
Hiseq dataFor Illumina HiSeq Analysis
Defuse dataData used by the deFuse program for gene fusion discovery using RNA-Seq data
Plinkseq dataData used by plinkseq, a library for working with human genetic variation data.
Novoalign indexesIndexes for the Novoalign aligner for single-ended and paired-end reads from the Illumina Genome Analyser
Fusionmap indexesUsed by the fusionmap program for fusion alignment
Gemini dataFor use by the Gemini program to explore genomic variation
DSSP A database of secondary structure assignments (and much more) for all protein entries in the Protein Data Bank (PDB).
Bfast indexesFor use by the Bfast program for fast and accurate mapping of short reads to reference sequences
GENCODEThe GENCODE Project: Encyclopaedia of genes and gene variants.
HMMER indicesIndexed for the HMMER program that uses profile hidden Markov models for biological sequence analysis
Impute2Reference data for the Impute program for genotype imputation and haplotype phasing
picrustData for the picrust program which represents Phylogenetic Investigation of Communities by Reconstruction of Unobserved STates (PICRUST).
APTFiles for Affymetrix GeneChipR arrays
gimmemotifsindex files for gimmemotifs
hapmapfiles used for hapmap to identify and catalog genetic similarities and differences in human beings
homerfiles used for homer that is for Motif Discovery and ChIP-Seq analysis
igenomes_extrafiles related to igenomes but are not directly downloaded from the formal release of illumina igenomes.
lifescopereference data used for the examples of lifescope program
misoindexes used for miso which is used for quantitates the expression level of alternatively spliced genes from RNA-Seq data.
rsegRseg chromosome size file
sratoolkitfiles for sratoolkit
plinkseqGenome files used by plinkseq program.
gmap/gsnap indicesindices used for alignments with gmap/gsnap
GATK resource bundleStandard data set for working with GATK
dbNSFPdbNSFP is a database developed for functional prediction and annotation of all potential non-synonymous single-nucleotide variants (nsSNVs) in the human genome. The data is in a tab-delimited file with header descriptions.
minimac1000 genomes data for minimac genotype imputation
krakenDatabases for the kraken taxonomic classification system
VEP dataFor use by the VEP program.
MemeMeme databases
Mascot For use by the Mascot search engine