Interdisciplinary Initiatives Program Round 2 – 2002

Doug Brutlag, Biochemistry
Daphne Koller, Computer Science

Many cellular functions are carried out by proteins and the interactions between them. The large scale genomic data sets being produced over the last few years provide us an opportunity to obtain a genome-wide view of cellular activity. The focus of this project is to apply statistical machine learning methods to these large but noisy data sets, in order to analyze protein function and interactions.

In particular, the focus of this project has been twofold. On the one hand, it tries to use motifs – fine-grained functional elements of a protein sequence – in order to predict the protein structure, function and its interaction with other proteins. On the other hand, it aims to integrate heterogeneous data sources – such as protein sequence characteristics (including motifs), protein fold, mRNA expression levels, and protein-protein interaction data – to obtain more robust predictions and a more global understanding of protein activity.