Elucidating the Determinants of Conserved Protein Surface Solvation Using a Genetic Algorithm and Nearest Neighbor Classifier

Michael L. Raymer, Paul C. Sanschagrin, William F. Punch, Erik D. Goodman, and Leslie A. Kuhn

Protein-ligand binding is almost universally mediated by water molecules bound in the interface, yet the structural chemistry governing water-mediated recognition remains largely unexplored. A key element to incorporating bound water into docking and ligand design is determining which water molecules will participate in water-mediated contacts and which will be displaced upon ligand binding. We address this problem using a hybrid genetic algorithm/k-nearest-neighbor classifier trained to recognize first-shell water molecules conserved upon ligand binding, based on their physical and chemical environment in the context of the ligand-free protein. For each first-shell water molecule in 30 non-homologous protein structures, eight features reflecting the environment of the water molecule were measured. These features included the crystallographic temperature factor (B-value) and mobility (B-value normalized by occupancy) of the water molecule, the number of hydrogen bonds from the water molecule to the protein and to other water molecules, and the local atomic density, atomic hydrophilicity, and average and net B-value of the protein atoms neighboring the water molecule. These features were then used to train a k-nearest-neighbor classifier to identify first-shell water molecules in the ligand-free protein that are likely to be conserved in the ligand-bound structure. A genetic algorithm was used to test various subsets of the available features, and to scale each feature's values to improve classification accuracy. Maximal predictive accuracy for first-shell water molecules was attained using a subset of four of the eight available features: B-value, mobility, and the number of hydrogen bonds to protein atoms and to water molecules; the relative weights determined for these features were 0.413, 0.315, 0.135, and 0.137, respectively. This weighted set of features was sufficient to allow the classifier to predict conservation of first-shell water molecules with an accuracy of 64.2% in unbiased cross-validation tests, which is significantly greater than the accuracy obtained by unweighted k-nearest-neighbor classification or discriminant analysis. By rewarding the genetic algorithm for using fewer features in classification, we also identified a two-feature set, with weights of 0.667 and 0.333 for water B-value and mobility, which resulted in a very similar predictive accuracy of 63.6. This indicates that the thermal mobility of a water site, rather than hydrogen-bond count, is an optimal predictor of whether a water site will be conserved.

Poster presented at the UCSF-MDI Meeting
"Molecular Recognition in Drug Design: Docking and Scoring"
San Francisco, CA, February 6-7, 1998