Campuses:

Mining a Gene Expression Database

Wednesday, October 1, 2003 - 1:30pm - 2:20pm
Keller 3-180
Peter Munson (National Institutes of Health)
The now widespread interest in gene expression is motivated by the promise of important new findings in the context of disease or basic biology research. Because of cost constraints, most designed studies are relatively small, involving from 2 to 100 chips. Pooling results of studies allows one to compare expression across potentially 1000s of conditions, with greater promise of additional insights. The NIHLIMS database houses data from about 30 ongoing studies at NIH comprising about 1500 Affymetrix chips, and provides a platform for testing data mining aprpoaches.

Serious data comparability challenges are encountered here, some of which can be addressed with appropriate data normalization. We investigate factors which distinguish patterns of expression. In addition to many technical factors, the cell or tissue type from which mRNA is prepared seems to be a primary source of variability. As a consequence, tissue specific genes can be identified by this approach. Limited demographic information may be available permitting, for example, the determination of gender-specific gene expression patterns. In one particular study, the identification of tissue specific genes in human was compared to tissue specific genes in rodent for the homologous tissue, allowing for an evolutionary comparison of the relevant expression mechanisms.

We discuss several of the statistical techniques needed to compare data across studies, and present a list of challenges now facing data miners.