THe NIH Biowulf cluster: 10 years of scientific supercomputing

De Novo Protein Structure Generation from Incomplete Chemical Shift Assignments

Ad Bax, Ph.D.

NMR chemical shifts provide important local structural information for proteins. Consistent structure generation from NMR chemical shift data has recently become feasible for proteins with sizes of up to 130 residues, and such structures are of a quality comparable to those obtained with the standard NMR protocol, but require massive computational efforts. We now investigate the influence of the completeness of chemical shift assignments on structures generated from chemical shifts. The Chemical-Shift-Rosetta (CS-Rosetta) protocol was used for de novo protein structure generation with various degrees of completeness of the chemical shift assignment, simulated by omission of entries in the experimental chemical shift data previously used for the initial demonstration of the CS-Rosetta approach. In addition, a new CS-Rosetta protocol is described that improves robustness of the method for proteins with missing or erroneous NMR chemical shift input data. This strategy, which uses traditional Rosetta for pre-filtering of the fragment selection process, is demonstrated for both paramagnetic proteins and for proteins with solid-state NMR chemical shift assignments. The extension of this technology to larger systems, where the impact would be greatest, is limited by computational resources required, which scale steeply with protein size.


Modeling Toxic Alzheimer Amyloid Ion Channels

Ruth Nussinov, Ph.D.

While there is a general agreement that Alzheimer toxicity results from calcium leakage into the cell, the mechanism of amyloid toxicity is poorly understood. There are two schools of thought in this hotly debated field: the first favors membrane destabilization by intermediate to large amyloid oligomers, with consequent membrane thinning and non-specific ion leakage; the second favors ion-specific permeable channels lined by small amyloid oligomers. Confusingly, published results currently support both mechanisms. Thus, currently available structural and physiological data present a challenge for computational biology: Can computations provide candidate models consistent with the experimental data and assist in reconciling the two views?

The talk will review current views of amyloid toxicity, highlighting mechanistic insights which might be obtained by combining experiment with detailed modeling. It will further describe our on-going work on the modeling of the Alzheimer A-beta oligomers in solution and in the lipid bilayer, and illustrate consistency with currently available experimental data. The modeling has been carried out using detailed atomistic molecular dynamic simulations and experimental data.


Replica Exchange Simulations of Protein-Protein Binding and Multi-Protein Complex Formation

Youngchan Kim, Ph.D. and Gerhard Hummer, Ph.D.

Protein-protein interactions play an essential role in the biological function of many cellular machineries. With advances in molecular and structural biology, and the advent of proteomics, it is now increasingly recognized that many important biological functions are carried out by multi-protein assemblies that form only transiently and are held together by relatively weak pair-wise interactions. We have developed coarse-grained models and effective energy functions for simulating thermodynamic and structural properties of multiprotein complexes with relatively low binding affinity (Kd > 1 µM) [1]. The replica exchange Monte Carlo (REMC) method, efficiently implemented on the parallel architecture of high performance Biowulf cluster, allows us to create equilibrium ensembles of bound and unbound states of multi-protein assemblies. With this model, we studied the protein complexes of the bacterial phosphotransferase system [2]. The simulations not only recover the known structures of the specific complexes and binding affinities, but also provide a detailed structural and energetic description of the non-specific transient encounter complexes, as validated by paramagnetic relaxation enhancement NMR experiments [2]. Together with the specific complex, a relatively small number of distinct nonspecific complexes largely accounts for the NMR data. From simulations of the Vps27 complex of the ESCRT membrane-protein trafficking system [1,3], we conclude that this membrane-associated multi-protein assembly is dynamic and open, allowing it to bind to a diverse set of ubiquitinated target proteins. We also find that the binding of its different domains is highly cooperative, which is essential for the proper function at the low protein concentrations inside the cell.

1.Kim YC, Hummer G. Coarse-grained models for simulations of multiprotein complexes: application to ubiquitin binding. J. Mol. Biol. 375:1416-33, 2008. 2.Kim YC, Chun T, Clore GM, Hummer G. Replica exchange simulations of transient encounter complexes in protein-protein association. Proc. Natl. Acad. Sci. 105, 12855, 2008. 3.Prag G, Watson H, Kim YC, Beach BM, Ghirlando R, Hummer G, Bonifacino JS, Hurley JH. The Vps27/Hse1 complex is a GAT domain-based scaffold for ubiquitin-dependent sorting. Dev. Cell 12: 973-86, 2007.


Population Substructure and Control Selection in Genome-wide Association Studies

Kai Yu, Ph.D.

Determination of the relevance of both demanding classical epidemiologic criteria for control selection and robust handling of population stratification (PS) represents a major challenge in the design and analysis of genome-wide association studies (GWAS). Empirical data from two GWAS in European Americans of the Cancer Genetic Markers of Susceptibility (CGEMS) project were used to evaluate the impact of PS in studies with different control selection strategies. In each of the two original case-control studies nested in corresponding prospective cohorts, a minor confounding effect due to PS (inflation factor λ of 1.025 and 1.005) was observed. In contrast, when the control groups were exchanged to mimic a cost-effective but theoretically less desirable control selection strategy, the confounding effects were larger (λ of 1.090 and 1.062). A panel of 12,898 autosomal SNPs common to both the Illumina and Affymetrix commercial platforms and with low local background linkage disequilibrium (pair-wise r 2 < 0.004) was selected to infer population substructure with principal component analysis. A novel permutation procedure was developed for the correction of PS that identified a smaller set of principal components and achieved a better control of type I error (to λ of 1.032 and 1.006, respectively) than currently used methods. The overlap between sets of SNPs in the bottom 5% of p-values based on the new test and the test without PS correction was about 80%, with the majority of discordant SNPs having both ranks close to the threshold. Thus, for the CGEMS GWAS of prostate and breast cancer conducted in European Americans, PS does not appear to be a major problem in well-designed studies. A study using suboptimal controls can have acceptable type I error when an effective strategy for the correction of PS is employed.


Influence of Genomic Variance on the Transcriptome in Human Brain Tissues

J. Raphael Gibbs and Andrew Singleton, Ph.D.

Variability in gene expression within brain tissue effects development, function and risk of neurological disease. It is now feasible to investigate the correlation between genomic variance and the transcriptome from a whole genome perspective. These data allow the examination of genetic and epigenetic variation and their effects on transcript expression at the mRNA and microRNA level. Identifying expression quantitative trait loci (eQTL), allows us to begin to build functional maps of genomic regions within the human brain with two primary outcomes; first such a resource would be useful for understanding genomic regions and the regulation of normal gene expression; second this would act as a resource for immediate mining of the effects of genetic variation, particularly variants identified in genome wide association studies, on transcription. In the current study we have collected tissue from four brain regions (frontal and temporal cortex, cerebellum and pons) from 150 individuals. For each individual we genotyped 550,000 single nucleotide polymorphisms (SNPs) and in all four brain regions measured DNA methylation at 27,500 CpG sites, expression of 700 microRNAs (miRNAs) and 22,000 gene transcripts (mRNAs). Within each brain region analyses were performed to identify correlations between data types. Across regions analyses were performed to identify expression and methylation differences as well differences within the region specific correlations.


Computing the Molecular Structures of Cells and Viruses using 3D Electron Microscopy

Sriram Subramaniam, Ph.D.

Emerging methods in three-dimensional biological electron microscopy provide powerful tools and great promise to bridge a critical gap in imaging in the biomedical size spectrum. This gap comprises a size range of considerable interest that includes cellular protein machines, giant protein and nucleic acid assemblies, small subcellular organelles and bacteria. These objects are generally too large and/or too heterogeneous to be investigated by high resolution X-ray and NMR methods; yet the level of detail afforded by conventional light and electron microscopy is often not adequate to describe their structures at resolutions high enough to be useful in understanding the chemical basis of biological function.

The long-term mission of our research program is to obtain an integrated molecular understanding of cellular architecture by combining novel technologies for 3D biological imaging with advanced methods for image segmentation and computational analysis. I will review our recent progress in imaging and modeling dynamic biological systems, with particular emphasis on applications to HIV/AIDS and cancer.

1. Liu, J., Bartesaghi. A., Borgnia, M. J., Sapiro. G. and Subramaniam, S. (2008) Molecular architecture of native gp120 trimers Nature 455, 109-113. 2. Bartesaghi, A., Sprechman, P., Liu, J., Randall, G., Sapiro, G. and Subramaniam, S. (2008) Classification and 3D averaging with missing wedge correction in biological electron tomography J. Struct. Biol. 162, 436-450. 3. Khursigara, C., Wu, X., Zhang, P., Lefman, J. and Subramaniam, S. (2008) Role of HAMP domains in chemotaxis signaling by bacterial chemoreceptors Proc. Natl. Acad. Sci. USA (in press). 4. Narasimha, R, Aganj, I., Borgnia, M. J., Bennett, A., Zabransky, D., Sapiro, G., McLaughlin, S., Milne, J. L. S. and Subramaniam, S. (2008) Automated denoising and feature identification in electron tomograms of viruses and cells J. Struct. Biol. (in press). 5. Heymann, J. A. W., Shi, D., Kim, S., Bliss, D., Milne, J. L. S. and Subramaniam, S. (2009) 3D imaging of mammalian cells with ion-abrasion scanning electron microscopy J. Struct. Biol. (in press).


Current Concepts and Future Directions in Virtual Colonoscopy Computer-Aided Detection

Ronald Summers, MD, Ph.D.

Colorectal cancer is the second leading cause of cancer death in the Western world. Virtual colonoscopy is a CT-based method that has proven capable of relatively noninvasive colorectal cancer screening. In virtual colonoscopy, three-dimensional reconstructions of the colon are prepared from CT scans of a patient's abdomen and pelvis. My research group has focused for the last several years on computer-aided detection (CAD) of polyps for virtual colonoscopy. We have developed shape-based features using differential geometry to identify abnormal growths in the colon. In association with collaborators at other institutions, we have developed a database of over 1200 proven virtual colonoscopy cases with optical colonoscopy correlation. Using this database, we continue to make advancements in improving sensitivity and reducing the false positive rate of CAD. Current work includes image processing to register supine and prone virtual colonoscopy examinations on the same patient, CAD of normal colonic features that mimic pathology, and software systems to train classifiers, validate results, and ensure software reliability and integrity. Because of the large size of our case database and the time- consuming processing required to train the CAD system, we have made extensive use of the NIH Biowulf facility. This lecture will provide an overview of the clinical background, mathematical underpinnings, and preliminary clinical trials conducted at the National Institutes of Health.


Improving Statistical Significance Assignment in Mass Spectrometry Based Peptide Identification

Yi-Kuo Yu, Ph.D.

A major challenge in mass-spectrometry-based proteomics is the peptide identification statistics problem. Our group has been working on methods to improve the accuracy of statistical significance assigment in peptide identification. This include development of a few new tools as well as practical approaches to improve identification sensitivity. For the former case, we have developed RAId_DbS and RAId_deNovo, both tested and ran extensively on the Biowulf cluster. For the latter case, we have proposed a protocol to calibrate statistics of various search methods, many of which are installed on Biowulf and the calibration runs were also done using Biowulf. In particular, we have devised a novel approach to combine the search results from different search methods using Fisher's formula to combine the P-values. More information can be found via the link http://www.ncbi.nlm.nih.gov/CBBResearch/qmbp. Relevant binaries for Linux, Windows, and Mac OS X are available from the same page.


PubChem: An Open Repository for Chemical Structure and Biological Activity Information

Steve Bryant, Ph.D.

PubChem is an online public information resource from the National Center for Biotechnology Information (NCBI). The system provides information on the biological activities of chemical substances, linking together results from multiple sources on the basis of chemical structure and/or chemical structure similarity. Following the deposition model introduced by GenBank, PubChem's content is derived from user depositions of chemical structure and bioassay data, including high-throughput biological screening results from the NIH Molecular Libraries program. PubChem information retrieval provides basic searches by chemical names or structures as well as more complex structure-activity analysis within and among different bioassays. PubChem provides further information on biological activities via links to other NCBI information resources, such as the PubMed biomedical literature database and NCBI's protein 3D structure database, as well as via links to depositor web sites.