# IMA Tea and more (with POSTER SESSION)

Monday, September 29, 2003 - 3:40pm - 5:00pm

Lind 400

**Fast Loess for Normalizing Microarray Data**

Karla Ballman (Mayo Clinic)

Joint work with Ann Oberg and Terry Therneau.

Various methods have been developed for normalizing high-density oligonucleotide arrays (as well as other gene expression microarray technologies) so that meaningful comparisons of gene expression levels can be made across arrays (experiments). The most useful methods are those with two explicit features: (1) they use data from all arrays in the proposed comparison to perform the normalization, and (2) they account for the non-linear relationship of intensities among arrays. Commonly used non-linear normalization techniques include cyclic loess and quantile normalization. We propose a new method, fast loess, which is similar in concept to cyclic loess normalization but uses a linear models argument to normalize all arrays at once. Results comparing the performance of cyclic loess, quantile normalization, and fast loess on simulated and real data will be presented. Fast loess and cyclic loess produce similar results but with fast loess being considerably faster than cyclic loess. Both fast loess and cyclic loess produce superior results to quantile normalization.**Cancer tumor classification using gene expression data**

Adriana Lopez (University of Pittsburgh)

At the end of the 90's, biotechnologies such as microarrays have been developed and their use in the research of cancer has increased because they can lead to a more precise and reliable classification of cancer tumors. This research was concerned with discriminant analysis or classification of cancer tumors using expression genetic data from microarrays in previously known classes, using kernel density estimation and combination of classifiers based on this methodology. This technique was compared to other well known discriminant analysis techniques using the misclassification proportion, estimated using training and test sets: fixed and obtained by the 2:1 sampling scheme. An equally efficient performance of the fixed kernel classifiers and the adaptative kernel classifiers was observed for the three data sets that were studied and generally, the kernel classifier was the best nonparametric classifier.**Hidden Markov Models for Microarray Time Course Data in Multiple Biological Conditions**

Ming Yuan (University of Wisconsin, Madison)

Motivated by several real applications, an approach is proposed to compare expression profiles from different biological conditions over time. It is based on a hidden Markov model (HMM) with states corresponding to expression patterns across conditions. To investigate properties of the proposed approach, we have implemented the HMM assuming a parametric hierarchical mixture model for the emissions, here intensities. As shown in simulation studies comparing the HMM approach to one which simply overlooks the correlation over time, both the sensitivity and the specificity increase substantially without sacrificing the false discovery rate. I will present a detailed analysis of the methodology and its performance.

This is a joint work with Prof. Kendziorski (see her talk on Thursday).**Gene Expression Profiling of Human Oral Cancer Using cDNA Microarrays**

Shilpi Arora (Princess Margaret Hospital)

Oral Squamous Cell Carcinoma (OSCC) is a clinically heterogeneous disease. Patients with stage-matched tumors show differences in treatment response and outcome, suggesting that a sub classification system may be possible. In the present study, we used cDNA microarrays and a novel method of analysis (Binary Tree-Structured Vector Quantization - BTSVQ) to classify 20 OSCC samples based on their gene expression profiles. BTSVQ analysis combines k-means clustering and self-organizing maps in a complementary fashion. In our study, the binary tree generated by BTSVQ revealed groups of patients that significantly correlated with male gender (P=0.035), T III-IV disease stage (P=0.035), and nodal metastasis (P=0.035). Further data mining revealed a subset of genes present in the sample cluster that enriched for node positive tumors, and thus may represent potential biomarkers for metastasis. The differential expression of these genes was validated by quantitative real-time PCR. We conclude that molecular sub typing of OSCC can identify distinct patterns of gene expression that correlate with clinical-pathological parameters. The genes identified may influence tumor growth, development and metastasis due to the over expression of normal gene products, gene amplification or mutation. They may therefore represent potential biomarkers for oral carcinomas. Our findings may help to form the basis for a molecular classification of OSCC, thus improving diagnosis, therapeutic decisions and outcome for patients with this lethal disease.

Joint with Giles C. Warner, Patricia P. Reis1, Igor Jurisica2,3,4, Mujahid Sultan4, Christina Macmillan5, Mahadeo Sukhai1,2, Reidar Grenman6, Richard A. Wells1, Dale Brown7, Ralph Gilbert7, Patrick Gullane7, Jonathan Irish7,and Suzanne Kamel-Reid*1,2,5.

1 Departments of Cellular and Molecular Biology, 2 Medical Biophysics, 3 Computer Science, 4 Cancer Informatics, 5 Laboratory Medicine and Pathobiology, 7 Otolaryngology/Surgical Oncology, University of Toronto, Princess Margaret Hospital, Ontario Cancer Institute, Toronto, Ontario, Canada. 6 Department of Otolaryngology, Turku University Central Hospital, Turku, Finland.**Modeling Differential Gene Expression using a Dirichlet Process Mixture Model**

David Dahl (University of Wisconsin, Madison)

The literature has given considerable attention to the task of identifying differentially-expressed genes using data from DNA microarrays. This poster proposes a conjugate Dirichlet Process mixture model which naturally incorporates any number of treatment conditions, clusters genes based on their treatment effects and variance, and readily makes general inference on the treatment effects and variance. As a consequence of the model, probabilities of co-regulation are available and there is no need to estimate the correct number of clusters. Any number of hypotheses concerning the parameters can be performed and false discovery rates are easily computed. The proposed methods are applied to a dataset of 10,043 genes measured at 10 treatments conditions with 3 replicates per treatment.**A Bayesian Method for Class Discovery and Gene Selection**

Mahlet Tadesse (Texas A & M University)

Joint work with Naijun Sha and Marina Vannucci.

The analysis of the high-dimensional data (p n) generated by DNA microarrays poses challenge to standard statistical methods. This has revived a strong interest in clustering algorithms. A typical goal in these analyses is the discovery of new classes of disease and the identification of relevant genes. Currently, investigators first resort to data filtering procedures or dimension reduction techniques before clustering the data. In addition, the clustering algorithms that are widely used do not provide an objective way to assess the number of classes. We propose a Bayesian method, which simultaneously identifies the number of clusters in the data and selects genes that best discriminate the different groups.**Data Depth Based Clustering and Classification**

Rebecka Jornsten (Rutgers, The State University of New Jersey)

Clustering and classification are important tasks for the analysis of microarray gene expression data. Classification of tissue samples can be a valuable diagnostic tool for diseases such as cancer. Clustering samples or experiments may lead to the discovery of subclasses of diseases. Clustering can also help identify groups of genes that respond similarly to a set of experimental conditions. In addition to these two tasks it is useful to have validation tools for clustering and classification. Here we focus on the identification of outliers - units that may have been misallocated, or mislabeled, or are not representative of the classes or clusters. We present two new methods: DDclust and DDclass, for clustering and classification. These non-parametric methods are based on the intuitively simple concept of data depth. We apply the methods to several gene expression and simulated data sets. We also discuss a convenient visualization and validation tool - the Relative Data Depth (ReD) plot.**Using Text Mining Networks for the Context Specific Interpretation of Expression Data**

Achim Tresch (Fraunhofer Institute)

Joint work with Christian Gieger, Daniel Hanisch, Juliane Fluck, Hartwig Deneke (Fraunhofer-Institute SCAI, Sankt Augustin, Germany), Tobias Mittelstädt, and Albert Becker (Institute for Neuropathology, University Hospital, Bonn, Germany).

Gene expression data are most often analysed without utilizing biomedical a-priori knowledge. The inclusion of metabolic, regulatory or protein-protein interaction networks into the analysis process itself provides a way to put results of expression experiments into a biological context. Unfortunately, network information stored in databases is oftentimes incomplete or not specific enough with respect to certain species or cell types. For this reason, we developed text mining methods for the construction of interaction networks based on biomedical free text. These methods were applied to the complete set of MEDLINE abstracts and resulted in a substantial network of protein relations. In this method, we used an automatically generated and curated gene/protein dictionary together with a biomedical grammar which defines rules to extract concepts describing relevant relations between genes/proteins and other biological entities. The resulting text mining network can be used for explorative data analysis by mapping the results of gene expression experiments onto the network. For this purpose, the ToPNet application was developed. Besides its visualization capabilities, it is able to identify sub-networks relevant according to observed expression patterns by applying a new method called Significant Area Search. Our approach was successfully applied to data from two sets of gene expression experiments in the context of epilepsy and brain cancer research.**Pooling or not pooling in microarray experiments - an experimental design point of view**

Kenny Ye (The State University of New York)

Joint work with Anil Dhundale, Department of Biomedical Engineering and Center for Biotechnology, SUNY at Stony Brook, Stony Brook, New York 11794-2580, (631) 632-8521, anil.dhundale@sunysb.edu

Microarray experiments are often used to detect differences in gene expression between two populations of cells; a test population versus a control population. However in many cases, such as individuals in a population, the biological variability can present changes that are irrelevant to the question of interest and it then becomes important to assay many individual samples to collect statistically meaningfully results. Unfortunately the cost of performing some types of microarray experiments can be prohibitive. A potentially effective but not well publicized alternative is to pool individual RNA samples together for hybridization on a single microarray. This method can dramatically reduce the experimental costs while maintaining high power in detecting the changes in expression levels that relate to the specific question of interest. In this talk, we will discuss why this technique works and the optimal design strategy for pooling. This idea will also be illustrated by a synthetic experiment and a real experiment that studies Afib (cardiac atrial fibrillation), a condition that is a serious health condition that affects a large percent of the population but mechanistically remains not well understood.**Why Infinite-dimensional topological groups may work for Genetics data**

Boris Khots (Compressor Controls Corporation)

A significant role in mathematical modeling and algorithms for applications to processing of Genetic data (for example, Gene expression data) may play infinite-dimensional P-spaces and connected with them infinite-dimensional P-groups and P-algebras (B.S. Khots, Groups of local analytical homeomorphisms of line and P-groups, Russian Math Surveys, v.XXXIII, 3 (201), Moscow, London, 1978, 189-190). Investigation of the topological-algebraic properties of P-spaces, P-groups and P-algebras is connected with the solution of the infinite-dimensional fifth David Hilbert problem. In Genetic data processing the utilization of the topological-algebraic properties of P-spaces, P-groups and P-algebras may permit to find Gene functionality. We applied these methods to Yeast Rosetta and Lee-Hood Gene expression data, leukemia ALL-AML Gene expression data and found the sets of Gene-Gene dependencies, Gene-Trait dependencies. In particular, accuracy of leukemia diagnosis is 0.97. On the other hand, Genetics requires a solution of new mathematical problems. For example, what are the topology-algebraic properties of a P-group (subgroups, normal subgroups, normal serieses,P-algebras, subalgebras, ideals, etc) that is finitely generated by local homeomorphisms of some manifold onto itself?**Procedure for standardisation and normalisation of cDNA microarrays**

Pim Kuurman (ID-Lelystad)

Joint work with M.H. Pool, B. Hulsegge, L.L.G Janss, J.M.J. Rebel, and S. van Hemert.

Expression levels for large numbers of genes under different conditions can be measured by using microarrays. In livestock species often cDNA-arrays are used for this purpose, because the complete genome sequences are not yet available to engineer oligo-arrays, and use of cDNA arrays allows the direct use of available cDNA libraries. However, cDNA-arrays exhibit larger variability than oligo arrays and therefore require more care in order to reduce noise, standardise and normalise the data, and require some different statistical approaches for analysis because two samples are measured on the same slide, unlike in oligo-array technology. This poster describes procedures developed to treat such data consisting of: (1) correction for background using special blank spots; (2) automatic outlier treatment using iteratively reweighted analysis to allow for a robust fit, similar to using medians; (3) a lowess fit to allow for dye-bias on the ratio's with varying intensity; (4) a procedure to identify poor duplicated values (1 duplicate is made within slide) fitting a heterogeneous variance contour to allow for increasing repeatability with increasing intensity; (5) fitting of a heterogeneous variance contour for sample values to allow for decreasing variance with increasing intensity, used to provide weights for a weighted analysis. The procedure is illustrated on a data set showing differences in gene expression levels between malabsorption syndrome infected and control chickens.**Extreme-Value Distribution Based Gene Selection Criteria for Discriminant Microarray Data Analysis Using Logistic Regression**

Wentian Li (North Shore - LIJ Research Institute)

Joint work with Fengzhu Sun (Department of Biological Sciences, Molecular and Computational Biology Program, University of Southern California, USA) and Ivo Grosse (Bioinformatics Center Gatersleben-Halle, Institute for Plant Genetics and Crop Plant Research, Germany) .

We present a calculation of the expected maximum-likelihood and the p-value for the top gene selected by the logistic regression. This calculation is based on the maximum likelihood of the null model and the extreme value distribution of chi-square variables. Based on this calculation, we propose two corresponding gene selection criteria: the E-criterion and the P-criterion. In the E-criterion, a gene is selected if its maximum-likelihood is greater than that of the top gene under the null model. In the P-criterion, a gene is selected if its p-value according to the null distribution of the the top gene is smaller than a pre-determined value. Both gene selection criteria are conservative because non-top-ranked genes are judged by the expected value of the top gene. As a result, a much more compact set of genes is selected.

References:

[1] W Li, I Grosse (2003), Gene selection criterion for discriminant microarray data analysis based on extreme value distributions, in RECOMB03: Proceedings of the Seventh Annual International Conference on Computational Biology, pp. 217-223 (ACM Press).

[2] W Li, F Sun, I Grosse (2003), Extreme-value distribution based gene selection criteria for discriminant microarray data analysis using logistic regression, submitted to Journal of Computational Biology.**Joint Estimation of Calibration and Expression for High-density Oligonucleotide Arrays**

Ann Oberg (Mayo Clinic)

Joint work with Karla V. Ballman, Douglas W. Mahoney, and Terry M. Therneau.

There is an increasing awareness that the analysis of high-density oligonucleotide arrays is better modeled as a holistic rather than a piecemeal process. Affymetrix software summarizes each chip (including scaling, background subtraction, and removal of outliers) separately, with the results of that summarization passed forward to the next stage of analysis. Li (2001) introduced a model-based analysis, where all chips for a given experimental condition were fit in a single model, giving a more complete and accurate picture of both data errors and the fit. Chu (2002) recently extended this idea, using a random-effects model to encompass all chips in an experiment at once. For all of these, however, normalization of the data is done as a separate prior process. We propose a method that integrates the normalization, visualized as chip specific calibration curves based on differential binding characteristics, along with model fitting incorporating experimental design in a unified algorithm. The ability to incorporate experimental design into both the normalization process and the fit leads to more efficient and less biased estimates of the tissue gene expressions. Affycomp results will be presented.**Bayesian Analysis of Microarrays**

Lidia Rejto (University of Delaware)

Microarray technology enables the assessment of expression patterns of thousands of genes over time and under multiple conditions. The analysis of these patterns requires detecting whether observed differences in expression levels are significant or not. To perform the analysis, one must first normalize the data. Here we present a stochastic model offering a method to normalize the data and to detect differentially expressed genes. The model is appropriate to deal with more than two experimental conditions or time series experiments.

We construct a model to describe the stochastic relationship between the real and the measured gene-expression levels. We introduce a Bayesian component, which assumes that there is a prior probability for the event, that the real expression levels are different under different conditions. The prior probability of the Bayesian component is estimated, together with the other model parameters by using the maximum-likelihood method. Having the estimated model parameters, we estimate the real gene-expression levels as conditional expectations. Furthermore, for each gene the posterior probability of differential expression is given. We estimated the variances of the estimates of the model parameters with the help of bootstrapping. The fitted parametric model was validated by verification of differential gene expression with real-time quantitative RT-PCR (qRT-PCR) analysis. The comparison shows that the stochastic model is adequate in identifying differentially expressed genes on microarrays.

The software BAM (Bayesian Analysis of Microarrays) is available at online http://udgenome.ags.udel.edu/~cogburn/Gene_Expression_Studies.htm or contact with bukszar@eecis.udel.edu.

Joint with Gábor Tusnády,1 József Bukszár,2 Guang Gao2 and Larry Cogburn3.

1 Alfréd Rényi Mathematical Institute of the Hungarian Academy of Sciences, Budapest, P.O. Box 127, H-1364, Hungary

2 Delaware Biotechnology Institute, 15 Innovation Way, Newark, 19711, USA

3 University of Delaware, Department of Animal and Food Sciences, Newark, DE 19717, USA**Statistical Inference Methods for Detecting Altered Gene Associations**

Hae-Hiang Song (The Catholic University of Korea)

Joint work with Sang-Heon Yoon and Je-Suk Kim.

In many gene expression studies, the assumption is that knowledge of where and when a gene is expressed carries important information about what the gene does. We consider the problem of understanding the gene functions with microarray expression data of histological progressive grades, starting from dysplastic nodule in cirrhotic liver to hepatocellular carcinoma Edmonson grade III. The statistical procedures are divided into two parts: First, microarray data are suitably normalized including a method of analysis of variance (ANOVA). Much diverse comments are found for the currently used normalization methods. In order to proceed to the second part of statistical analyses of gene-pair associations, these normalization methods need first to be compared. Based on the assumption that a union set of significant genes from these normalization methods includes sufficiently general and well defined differentially expressed genes, the second part of statistical analyses of searching evidence of altered gene-gene relationships with progression of disease is carried out. Significantly altered gene-pair associations are identified with the ratio of gene-pair correlations. When we use the phrase of difference between normal and tumor expression patterns, in a broad sense it contains not only the information summarized by the first moment of average expression levels, but also imply correlation changes between two stages, and this kind of exploration goes on to a higher order moments. The need to study association changes naturally arises when analyzing gene expression levels of multiple arrays obtained in different stages of progression. We identify altered gene-gene relationships with replicated microarray expression data.

Keywords: oligonucleotide array, normalization, correlation ratio statistic, hepatic nodular lesions**A Spectral Clustering Method for Microarray Data**

Joseph Beyene (Hospital for Sick Children)

Joint work with David Tritchler and Shafagh Fallah.

Cluster analysis is a commonly used dimension reduction technique. We introduce a clustering method computationally based on eigenanalysis. Our focus is on large problems, and we present the method in the context of clustering genes and arrays using microarray expression data. The computational algorithm for the method has complexity linear in the number of genes. We also introduce a method for assessing the number of clusters exhibited in microarray data based on the eigenvalues of a particular matrix.**A Theoretical Framework For Reconstructing Missing Data in Genome - Wide Matrix**

Shmuel Friedland (University of Illinois, Chicago)

This is a joint work with Amir Niknejad.

Since last decade, molecular biologist have been using DNA microarray(chip) as a tool for analyzing information embedded in gene expression data. During the laboratory process,some spots on the array may be missed and probing genes might fail . It is still very costly making chips to probe genes(DNA microarray).

There have been several attempts by molecular biologists,statistician, and computer scientists to recover the missing gene expressions by some ad-hoc methods. Most recently, microarray gene expression has been formulated as a gene-array matrix .In this setting, the analysis of missing gene expression on the array would translate to recovering some missing entries in gene - expression matrix.

The most common methods for recovery are: (a) Various clustering analysis methods such as K - nearest neighbor clustering , hierarchical clustering. (b) SVD - Singular Value Decomposition. In these methods, the recovery of missing data is done independently, i.e. the completion of each missing entry does not influence the completion of other entries.

We suggest here a new method in which the completion of missing entries is done simultaneously, i.e. the completion of one missing entry influence the completion of other entries. Our method is closely related to the methods and techniques for solving inverse eigenvalue problems.