Advances in sequencing and DNA replication technologies have made it possible to sequence the entire genomes of many organisms. By the end of 1998, more than 20 species had been completely sequenced and the number is estimated to exceed 100 in the next two years. Co-currently with the various genome sequencing projects, powerful technologies are also being developed for the simultaneous measurement of the differential expression of each individual gene in a genome under a particular set of environmental conditions. These technologies utilize DNA hybridization reactions between the complimentary DNA of a sample and thousands of DNA probes, specific to individual genes, immobilized at high resolution in precise locations on a glass or membrane substrate. Along with protein content measurements (2-D gels) and estimates of in vivo metabolic fluxes, gene expression data contain, in principle, the information needed to meaningfully decipher gene regulation and elucidate cell physiology.
In contrast to the impressive progress in developing analytical technologies and instrumentation, systematic methods for the effective analysis of the data generated by these technologies have received rather scant attention. This presentation will attempt to define questions pertinent to the overall effort of upgrading the information content of sequence, gene expression, and in vivo flux data. An integrated approach will be presented along with applications of specific data mining methodologies. Our overall objective is to most efficiently utilize these technologies for extracting valuable biological information that will allow researchers to synthesize roadmaps of cellular function with serious implications for medicine, pharmacology, but also biology in general and biotechnology.