Biowulf High Performance Computing at the NIH
Scientific Databases

A set of centrally-maintained and updated scientific databases is made available to users of Helix and Biowulf. Click on a link below to see the available databases. [List of genomes]

.dictSequence dictionary file produced by Picard CreateSequenceDictionary
.faiFasta index file produced by samtools faidx
AnnotationsGenome annotations
ANNOVARTab-delimited text files for use with ANNOVAR.
APTFiles for Affymetrix GeneChipR arrays
BAMBinary SAM files
Bfast indexesFor use by the Bfast program for fast and accurate mapping of short reads to reference sequences
BlastBlast v5 databases. For use via command-line Blast or easyblast on Biowulf
Blast_v4Old Blast v4 databases. No longer updated at NCBI, will be deleted in summer 2020 from Biowulf.
CSDCambridge Structural Database. Can be accessed through WebCSD (NIH access only)
dbNSFPdbNSFP is a database developed for functional prediction and annotation of all potential non-synonymous single-nucleotide variants (nsSNVs) in the human genome. The data is in a tab-delimited file with header descriptions.
Defuse dataData used by the deFuse program for gene fusion discovery using RNA-Seq data
DSSP A database of secondary structure assignments (and much more) for all protein entries in the Protein Data Bank (PDB).
FastaFasta-format flatfile databases used by Fasta, Blat and other programs.
Fusionmap indexesUsed by the fusionmap program for fusion alignment
GATK resource bundleStandard data set for working with GATK
Gemini dataFor use by the Gemini program to explore genomic variation
GENCODEThe GENCODE Project: Encyclopaedia of genes and gene variants.
gimmemotifsindex files for gimmemotifs
gmap/gsnap indicesindices used for alignments with gmap/gsnap
hapmapfiles used for hapmap to identify and catalog genetic similarities and differences in human beings
Hisat indexesIndexes for the Hisat program for mapping RNA-seq reads
Hiseq dataFor Illumina HiSeq Analysis
HMMER indicesIndexed for the HMMER program that uses profile hidden Markov models for biological sequence analysis
homerfiles used for homer that is for Motif Discovery and ChIP-Seq analysis
IgenomesIllumina's Igenomes are a collection of reference sequences and annotation files for commonly analyzed organisms. More info at Illumina
igenomes_extrafiles related to igenomes but are not directly downloaded from the formal release of illumina igenomes.
Impute2Reference data for the Impute program for genotype imputation and haplotype phasing
krakenDatabases for the kraken taxonomic classification system
MemeMeme databases
minimac1000 genomes data for minimac genotype imputation
misoindexes used for miso which is used for quantitates the expression level of alternatively spliced genes from RNA-Seq data.
MySQLAccessible through the HPC mirror of the UCSC Genome Browser. Also available for direct MySQL queries from the Biowulf cluster nodes.
Novoalign indexesIndexes for the Novoalign aligner for single-ended and paired-end reads from the Illumina Genome Analyser
PDBProtein Data Bank 3-D structures of macromolecules. Can be accessed at /pdb on any HPC system, or users can samba-mount the PDB database on their own systems.
PFAMA large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs). Used by HMMER and other programs.
picrustData for the picrust program which represents Phylogenetic Investigation of Communities by Reconstruction of Unobserved STates (PICRUST).
plinkseqGenome files used by plinkseq program.
Plinkseq dataData used by plinkseq, a library for working with human genetic variation data.
rsegRseg chromosome size file
sratoolkitfiles for sratoolkit
taxonomyNCBI Taxonomy database dump files
Tesseract dataTesseract trained data
VCFVariant Call Format for genomic data
VEP dataFor use by the VEP program.
Mascot For use by the Mascot search engine
Genomes