Scientific Databases
A set of centrally-maintained and updated scientific databases is made available to users of Helix and Biowulf. Click on a link below to see the available databases. [List of genomes]
.dict | Sequence dictionary file produced by Picard CreateSequenceDictionary |
.fai | Fasta index file produced by samtools faidx |
Annotations | Genome annotations |
ANNOVAR | Tab-delimited text files for use with ANNOVAR. |
APT | Files for Affymetrix GeneChipR arrays |
BAM | Binary SAM files |
Bfast indexes | For use by the Bfast program for fast and accurate mapping of short reads to reference sequences |
Blast | Blast v5 databases. For use via command-line Blast or easyblast on Biowulf |
Blast_v4 | Old Blast v4 databases. No longer updated at NCBI, will be deleted in summer 2020 from Biowulf. |
CSD | Cambridge Structural Database. Can be accessed through WebCSD (NIH access only)
|
dbNSFP | dbNSFP is a database developed for functional prediction and annotation of all potential non-synonymous single-nucleotide variants (nsSNVs) in the human genome. The data is in a tab-delimited file with header descriptions. |
Defuse data | Data used by the deFuse program for gene fusion discovery using RNA-Seq data |
DSSP | A database of secondary structure assignments (and much more) for all protein entries in the Protein Data Bank (PDB). |
Fasta | Fasta-format flatfile databases used by Fasta, Blat and other programs. |
Fusionmap indexes | Used by the fusionmap program for fusion alignment |
GATK resource bundle | Standard data set for working with GATK |
Gemini data | For use by the Gemini program to explore genomic variation |
GENCODE | The GENCODE Project: Encyclopaedia of genes and gene variants. |
gimmemotifs | index files for gimmemotifs |
gmap/gsnap indices | indices used for alignments with gmap/gsnap |
hapmap | files used for hapmap to identify and catalog genetic similarities and differences in human beings |
Hisat indexes | Indexes for the Hisat program for mapping RNA-seq reads |
Hiseq data | For Illumina HiSeq Analysis |
HMMER indices | Indexed for the HMMER program that uses profile hidden Markov models for biological sequence analysis |
homer | files used for homer that is for Motif Discovery and ChIP-Seq analysis |
Igenomes | Illumina's Igenomes are a collection of reference sequences and annotation files for commonly analyzed organisms. More info at Illumina |
igenomes_extra | files related to igenomes but are not directly downloaded from the formal release of illumina igenomes. |
Impute2 | Reference data for the Impute program for genotype imputation and haplotype phasing |
kraken | Databases for the kraken taxonomic classification system |
macs2 output | peak output from macs2 |
Meme | Meme databases |
minimac | 1000 genomes data for minimac genotype imputation
|
miso | indexes used for miso which is used for quantitates the expression level of alternatively spliced genes from RNA-Seq data. |
Mol2 | A Tripos Mol2 file (.mol2) is a complete, portable representation of a SYBYL molecule. It is an ASCII file which contains all the information needed to reconstruct a SYBYL molecule. |
MySQL | Accessible through the HPC mirror of the UCSC Genome Browser. Also available for direct MySQL queries from the Biowulf cluster nodes.
|
Novoalign indexes | Indexes for the Novoalign aligner for single-ended and paired-end reads from the Illumina Genome Analyser |
PDB | Protein Data Bank 3-D structures of macromolecules. Can be accessed at /pdb on any HPC system, or users can samba-mount the PDB database on their own systems.
|
PFAM | A large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs). Used by HMMER and other programs. |
picrust | Data for the picrust program which represents Phylogenetic Investigation of Communities by Reconstruction of Unobserved STates (PICRUST). |
plinkseq | Genome files used by plinkseq program. |
Plinkseq data | Data used by plinkseq, a library for working with human genetic variation data. |
RepBase | RepBase formatted files |
rseg | Rseg chromosome size file |
sratoolkit | files for sratoolkit |
taxonomy | NCBI Taxonomy database dump files |
Tesseract data | Tesseract trained data |
VCF | Variant Call Format for genomic data |
VEP data | For use by the VEP program. |
Mascot | For use by the Mascot search engine |
Genomes