Machine Learning in Structural Bioinformatics
We have been exploring the application of machine learning to prediction problems in structural biology, particularly the interactions of proteins with other proteins and small ligands.
- Protein-protein binding sites: Protein-protein interactions are responsible for much of what goes on inside cells. Specific proteins may stick together to form stable complexes that carry out a particular biochemical task or form transient signalling interactions. Protein binding sites are generally distinguished from the remaining protein by being more hydrophobic, enriched in large hydrophobic and uncharged polar residues, depleted of charged residues, and having higher evolutionary conservation. Binding sites are also composed of one or a few contiguous surface regions. These properties can be used to predict protein-protein binding sites on a protein structure, even though the identity of the binding partner may be unknown.
- Protein docking: The protein docking problem is to predict of the structure of a protein complex given the (unbound) structures of its component subunits. In rigid docking, potential bound conformations with sufficient surface shape complementarity are generated and a scoring function is used to select the few correct solutions from among the numerous incorrect conformations. We explored a machine learning approach to scoring potential solutions. Random Forests were trained on diverse properties of the docking conformations in order to distinguish the correct structures.
- Identifying biological complexes in X-ray structures: The protein molecules in crystals used to determine X-ray structures are arranged in a regular array that extends in all directions. Annotation in structure files of which molecules form the biologically relevant complex is error-prone and it is a non-trivial task to infer this information from the structure alone. The main subproblem that must be solved is to distinguish correct (specific) from incorrect (non-specific or crystal contact) protein-protein interfaces. Because this is similar to distinguishing docking solutions, a similar approach using a Random Forest trained on interface properties is effective. Also, similar interfaces between pairs of related (homologous) proteins are likely to belong to the same class (either specific or non-specific). This information was combined with the number of subunits in the annotated biological complex in order to arrive at a prediction that enforces this consistency condition across the entire set of PDB protein X-ray structures.
- Metal ion and small molecule binding sites in proteins: Many enzymes require metal ions or small molecule cofactors in order to function. Which ions or ligands bind to a particular protein and if so, where the binding sites are located may be unknown. In this case, computational methods can make useful predictions that can later be verified by experiments. Unlike protein-protein or metal ion binding sites, small ligand binding sites are typically in a surface pocket that is present even in the unbound protein. Metal ion binding sites are distinguished by characteristic residues, many of which are of opposite (negative ) charge. Also, like protein-protein binding sites, metal ion and small ligand binding sites usually have higher evolutionary conservation. These characteristics, combined with the arrangement of residue pairs, were used in a machine learning approaching to predicting binding sites for specific divalent metal ions and small ligands.
- Disease-associated mutations occurring in functional sites: We are currently developing computational tools to infer the biochemical and biological consequences of human genome variants, called non-synonymous SNPs, occurring in diverse classes of functional sites including those involved in protein, DNA and small ligand ligand binding, posttranslational modification, and crucial conformational changes. This method is expected to improve the prediction of disease-associated mutations as well as predict the specific biochemical effects of such mutations that lead to disease.
Machine Learning in Phylogenetics
Probabilistic models of protein evolution generally assume independent amino acid substitution rates across sites, despite experimental and computational evidence suggesting that physical interactions introduce dependences between sites. We recently introduced a new model of protein evolution based on probabilistic graphical models that accounts for site-site correlations. Significantly, phylogenetic likelihoods can be efficiently calculated in this model using approximate inference methods, like Belief Propagation. When tested on sequence data for a number of protein families, the new model was found to fit the data better than traditional site-independent rate matrix models. Interestingly, the model also supported the significance of amino acid interactions across protein-protein interfaces in determining the evolutionary history for a family of multimeric enzymes. One potential application is as an improved null model for detection of evolutionary selection, which could aid in detecting disease-associated single nucleotide variants.
Graft versus host disease is a potentially serious complication following transplant surgery. We are working with an indiscplinary team comprised of transplant surgeons and bioinformaticians to develop machine learning algorithms that predict transplant outcomes based on diverse data including degree of donor-patient immune system match (HLA typing), demographics, previous disease history, and pre-operation lab results. Preliminary analyses on a large-scale clinical study of liver transplant patients revealed which factors were most closely correlated with graft rejection and demonstrated that the prediction approach is accurate enough to guide clinical decisions in the future.
The binding of protein fragments to class I and II MHC molecules is an essential step in immune surveillance by the adaptive immune system. The peptide-MHC complexes are exposed on the cell surface where they can activate a T-cell immune response by binding to T-cell receptors and associated co-receptors. Both classes of MHC are highly polymorphic, with hundreds of alleles in humans. Each MHC type generally binds a different set of peptides so that the number of peptide-MHC combinations is enormous. Knowledge of which peptides bind to MHCs has potential applications in vaccine discovery and understanding autoimmune disorders.
Sequence-based computational methods can rapidly predict which peptides bind a particular MHC type, however they require large amounts of experimental data, which is expensive to obtain and unavailable for most MHC types. Structure-based computational methods are slower but potentially more general because they predict binding affinities based on universal physical or statistical properties of the modeled peptide-MHC complex and thus are not limited to a single MHC type. We have previously demonstrated that all-atom docking of peptides to class I MHC yields accurate structures, which can be used for predicting the binding affinities. We also found that fitting the prediction model to experimental data for one MHC type gives comparable accuracy for prediction of peptide binding affinities for a different MHC type. This shows that the method can generalize to different MHC types for which sufficient experimental data may be unavailable.
An estimated 25-30% of the proteins in a variety of organisms span a lipid membrane. Knowledge of such proteins in humans is important for drug discovery since ~40% of all current drugs target membrane proteins. Most of these drug targets are G protein-coupled receptors (GPCRs).
Computational methods are useful for modeling GPCR structures because only a few high-resolution experimental structures are available. We are working closely with experimental collaborators in order understand the function of GPCRs at the structural level. We are particularly interested in GPCR dimerization, which recent experiments show occurs for many GPCRs.