Probability
and Statistics in Complex Systems: Genomics, Networks, and Financial
Engineering, September 1, 2003 - June 30, 2004
Abstracts:
September
29-October 3, 2003
Group
Photo Material
from Talks
David
Allison
(Section on Statistical Genetics, Department of Biostatistics,
University of Alabama at Birmingham) DAllison@ms.soph.uab.edu
Applying
High-Dimensional Approaches to Microarray Research Slides:
pdf
Although termed the post-genomic era, our age may be more accurately
labeled the genomic era. Draft sequences of several genomes
coupled with new technologies allow study of the influences
and responses of entire genomes rather than isolated single
genes. This opens a new realm of highly dimensional biology
(HDB) where questions involve multiplicity at unprecedented
scales. HDB can involve thousands of genetic polymorphisms,
gene expression levels, protein measurements, genetic sequences,
or any combination of these and their interactions. Such situations
demand creative approaches to the processes of inference, estimation,
prediction, classification, and study design. Although bench
scientists intuitively grasp the need for flexibility in the
inferential process, elaboration of formal statistical frameworks
supporting this are just beginning. I will discuss some of the
unique statistical challenges facing investigators studying
high-dimensional biology, describe some approaches being developed
by scientists at UAB and elsewhere and offer an epistemological
framework for the validation of proffered statistical procedures.
Shilpi
Arora
(Cellular and Molecular Biology, Princess Margaret Hospital/Ontario
Cancer Institute, Toronto) sarora@uhnres.utoronto.ca
Gene
Expression Profiling of Human Oral Cancer Using cDNA Microarrays
(poster session)
Oral
Squamous Cell Carcinoma (OSCC) is a clinically heterogeneous
disease. Patients with stage-matched tumors show differences
in treatment response and outcome, suggesting that a sub classification
system may be possible. In the present study, we used cDNA microarrays
and a novel method of analysis (Binary Tree-Structured Vector
Quantization - BTSVQ) to classify 20 OSCC samples based on their
gene expression profiles. BTSVQ analysis combines k-means clustering
and self-organizing maps in a complementary fashion. In our
study, the binary tree generated by BTSVQ revealed groups of
patients that significantly correlated with male gender (P=0.035),
T III-IV disease stage (P=0.035), and nodal metastasis (P=0.035).
Further data mining revealed a subset of genes present in the
sample cluster that enriched for node positive tumors, and thus
may represent potential biomarkers for metastasis. The differential
expression of these genes was validated by quantitative real-time
PCR. We conclude that molecular sub typing of OSCC can identify
distinct patterns of gene expression that correlate with clinical-pathological
parameters. The genes identified may influence tumor growth,
development and metastasis due to the over expression of normal
gene products, gene amplification or mutation. They may therefore
represent potential biomarkers for oral carcinomas. Our findings
may help to form the basis for a molecular classification of
OSCC, thus improving diagnosis, therapeutic decisions and outcome
for patients with this lethal disease.
Joint
with Giles C. Warner, Patricia
P. Reis1, Igor Jurisica2,3,4,
Mujahid Sultan4, Christina
Macmillan5, Mahadeo
Sukhai1,2, Reidar Grenman6,
Richard A. Wells1, Dale
Brown7, Ralph Gilbert7,
Patrick Gullane7, Jonathan
Irish7,and Suzanne Kamel-Reid*1,2,5.
1
Departments of Cellular and Molecular Biology, 2
Medical Biophysics, 3 Computer Science, 4
Cancer Informatics, 5 Laboratory Medicine and Pathobiology,
7 Otolaryngology/Surgical Oncology, University of
Toronto, Princess Margaret Hospital, Ontario Cancer Institute,
Toronto, Ontario, Canada. 6 Department of Otolaryngology,
Turku University Central Hospital, Turku, Finland.
Keith
Baggerly M.D. (Anderson Cancer Center) kabagg@odin.mdacc.tmc.edu
The
Analysis of Proteomics Spectra from Serum Samples
Slides: pdf
Mass
spectrometry profiles can provide quick summaries of the relative
levels of hundreds of proteins. By surveying profiles from a
large number of samples, we can hopefully zoom in on proteins
that are linked with a difference of interest such as the presence
or absence of cancer. Using examples from two case studies,
we will address issues of experimental design, data cleaning
and processing, discriminating subsets, and protecting against
spurious structure.

Karla
Ballman
(Division of Biostatistics, Mayo Clinic) Ballman.Karla@mayo.edu
Fast
Loess for Normalizing Microarray Data (poster
session)
Joint
work with Ann Oberg and Terry
Therneau.
Various
methods have been developed for normalizing high-density oligonucleotide
arrays (as well as other gene expression microarray technologies)
so that meaningful comparisons of gene expression levels can
be made across arrays (experiments). The most useful methods
are those with two explicit features: (1) they use data from
all arrays in the proposed comparison to perform the normalization,
and (2) they account for the non-linear relationship of intensities
among arrays. Commonly used non-linear normalization techniques
include cyclic loess and quantile normalization. We propose
a new method, fast loess, which is similar in concept to cyclic
loess normalization but uses a linear models argument to normalize
all arrays at once. Results comparing the performance of cyclic
loess, quantile normalization, and fast loess on simulated and
real data will be presented. Fast loess and cyclic loess produce
similar results but with fast loess being considerably faster
than cyclic loess. Both fast loess and cyclic loess produce
superior results to quantile normalization.
Joseph
Beyene
(Department of Public Health Sciences, University of Toronto)
joseph@utstat.toronto.edu
A
Spectral Clustering Method for Microarray Data (poster
session)
Joint
work with David Tritchler and Shafagh
Fallah.
Cluster
analysis is a commonly used dimension reduction technique. We
introduce a clustering method computationally based on eigenanalysis.
Our focus is on large problems, and we present the method in
the context of clustering genes and arrays using microarray
expression data. The computational algorithm for the method
has complexity linear in the number of genes. We also introduce
a method for assessing the number of clusters exhibited in microarray
data based on the eigenvalues of a particular matrix.
Atul
Butte,
MD (Children's Hospital Informatics Program and Harvard Medical
School) atul.butte@TCH.Harvard.edu
Integrative
Genomics and its Implications for Clinical Research and Care:
What are the Real Issues Beyond Analysis? Slides:
pdf
Microarrays can provide systematic quantitative information
on the expression of thousands of unique RNAs, and have been
used to elucidate the transcriptional response in many basic
biological and clinically relevant experiments, ranging from
associative studies between therapeutics and expression, to
aiding in diagnostic questions, to discovery of novel subtypes
of disease.
Given
the over 6000 arrays with data publicly available and the surfeit
of microarray facilities, the rate-limiting steps is no longer
the sample collection, hybridization, scanning, or even the
analysis. Instead, the new challenge is in taking findings,
such as the traditional "list of genes" resulting
from a microarray analysis, and ascertaining the meaning of
the results, such as the biological relationships between the
genes. However, tools that link these genes back to known biological
pathways, as well as discovering new pathways, are in their
infancy. Tools that automatically suggest the importance of
particular findings have yet to be invented.
During
this presentation, I will describe four packages we have made
freely available to the academic genomics community. I will
present examples of and would like to discuss (1) not all pathways
will be reverse engineered using microarrays, (2) looking for
simultaneous gene associations ignores the fact that biology
takes time, (3) a discovered diagnostic model doesn't imply
the underlying molecular physiology, (4) due to rapidly changing
information about the genes already measured, one is never truly
finished analyzing a microarray dataset, and (5) the real bottleneck
in microarray analysis is not the analysis, but the interpretation
of the findings.
David
B. Dahl
(Statistics and Biostatistics & Medical Informatics, University
of Wisconsin - Madison) dbdahl@stat.wisc.edu
Modeling
Differential Gene Expression using a Dirichlet Process Mixture
Model (poster session)
The
literature has given considerable attention to the task of identifying
differentially-expressed genes using data from DNA microarrays.
This poster proposes a conjugate Dirichlet Process mixture model
which naturally incorporates any number of treatment conditions,
clusters genes based on their treatment effects and variance,
and readily makes general inference on the treatment effects
and variance. As a consequence of the model, probabilities of
co-regulation are available and there is no need to estimate
the correct number of clusters. Any number of hypotheses concerning
the parameters can be performed and false discovery rates are
easily computed. The proposed methods are applied to a dataset
of 10,043 genes measured at 10 treatments conditions with 3
replicates per treatment.
Sandrine
Dudoit (Division of Biostatistics, University of
California, Berkeley) sandrine@stat.Berkeley.EDU
http://www.stat.berkeley.edu/~sandrine
Loss-based
Estimation Methodology with Cross-validation: Prediction of
Clinical Outcomes Using Microarray Data
Slides: pdf
We
propose a unified loss-based methodology for estimator construction,
selection, and performance assessment with cross-validation.
In this approach, the parameter of interest is defined as the
risk minimizer for a suitable loss function and candidate estimators
are generated using this (or possibly another) loss function.
Cross-validation is applied to select an optimal estimator among
the candidates and to assess the overall performance of the
resulting estimator. Finite sample and asymptotic optimality
results are derived for the cross-validation selector for general
data generating distributions, loss functions (possibly depending
on a nuisance parameter), and estimators. This general estimation
framework encompasses a number of problems which have traditionally
been treated separately in the statistical literature, including
multivariate outcome prediction and density estimation based
on censored data. Applications to genomic data analysis include
the prediction of biological and clinical outcomes (possibly
censored) using microarray gene expression measures, the identification
of regulatory motifs in DNA sequences, and genetic mapping with
single nucleotide polymorphisms (SNP). This talk will focus
on tree-based estimation of patient survival with microarray
expression measures.
Joint
work with: Mark van der Laan, Sunduz
Keles, and Annette Molinaro.

Shmuel
Friedland
(Department of Mathematics, Statistics & Computer Science University
of Illinois-Chicago. Currently joining the IMA New Directions
Visiting Professorship program.) friedlan@uic.edu
A
Theoretical Framework For Reconstructing Missing Data in Genome
- Wide Matrix (poster session)
This
is a joint work with Amir Niknejad.
Since last decade, molecular biologist have been using DNA microarray(chip)
as a tool for analyzing information embedded in gene expression
data. During the laboratory process,some spots on the array
may be missed and probing genes might fail . It is still very
costly making chips to probe genes(DNA microarray).
There have been several attempts by molecular biologists,statistician,
and computer scientists to recover the missing gene expressions
by some ad-hoc methods. Most recently, microarray gene expression
has been formulated as a gene-array matrix .In this setting,
the analysis of missing gene expression on the array would translate
to recovering some missing entries in gene - expression matrix.
The most common methods for recovery are: (a) Various clustering
analysis methods such as K - nearest neighbor clustering , hierarchical
clustering. (b) SVD - Singular Value Decomposition. In these
methods, the recovery of missing data is done independently,
i.e. the completion of each missing entry does not influence
the completion of other entries.
We suggest here a new method in which the completion of missing
entries is done simultaneously, i.e. the completion of one missing
entry influence the completion of other entries. Our method
is closely related to the methods and techniques for solving
inverse eigenvalue problems.
Wolfgang
Huber
(German Cancer Research Center, Division of Molecular Genome
Analysis) w.huber@dkfz-heidelberg.de
Interpretation
and Transformation of Microarray Data
Slides: html pdf ps
ppt
Data
from microarray experiments is often reported in the form of
logarithmic ratios or logarithm-transformed intensities. This
amounts to the assumption that an increase from, say, 1000 units
to 2000 units has the same biological significance as one from
10000 to 20000. While this approach is useful for large intensities,
it fails when the true level of expression of a gene in one
of the conditions is small or zero. However, these genes may
be biologically relevant, perhaps even the most relevant ones.
We
derive a measure of differential expression that has comparable
resolution across the whole dynamic range of expression. Mathematically,
it can be expressed in terms of a variance stabilizing transformation.
The measure coincides with the log-ratio in those cases where
the latter is well-defined, and is a meaningful extrapolation
in those cases where the log-ratio is unstable. The measure
is closely related to the standardized log-ratio ("moving-window
z-score"), but has more preferable mathematical and computational
properties.
We
present a parametric statistical model that leads to a robust
estimator for the transformation parameters, as well as the
between-array normalization parameters. In applications to several
benchmark datasets, this approach compares favorably to other
normalization algorithms.
Rebecka
Jornsten
(Department of Statistics, Rutgers University) rebecka@stat.rutgers.edu
http://www.stat.rutgers.edu/~rebecka
Data
Depth Based Clustering and Classification (poster
session)
Clustering
and classification are important tasks for the analysis of microarray
gene expression data. Classification of tissue samples can be
a valuable diagnostic tool for diseases such as cancer. Clustering
samples or experiments may lead to the discovery of subclasses
of diseases. Clustering can also help identify groups of genes
that respond similarly to a set of experimental conditions.
In addition to these two tasks it is useful to have validation
tools for clustering and classification. Here we focus on the
identification of outliers - units that may have been misallocated,
or mislabeled, or are not representative of the classes or clusters.
We present two new methods: DDclust and DDclass, for clustering
and classification. These non-parametric methods are based on
the intuitively simple concept of data depth. We apply the methods
to several gene expression and simulated data sets. We also
discuss a convenient visualization and validation tool - the
Relative Data Depth (ReD) plot.
Christina
Kendziorski (Department of Biostatistics and Medical
Informatics, University of Wisconsin, Madison) kendzior@biostat.wisc.edu
Hidden
Markov Models for Microarray Time Course Data in Multiple Biological
Conditions
Slides: html
pdf
ps
ppt
Among
the first microarray experiments were those measuring expression
over time, and time course experiments remain common. Most methods
to analyze time course data attempt to group genes sharing similar
temporal profiles within a single biological condition. However,
with time course data in multiple conditions, a main goal is
to identify differential expression patterns over time. I will
present a Hidden Markov modeling approach designed specifically
to address this question. Simulation studies show a substantial
increase in sensitivity without an increase in the false discovery
rate when compared to a marginal analysis at each time point.
Results from three case studies will be discussed.
This
is joint work with Ming Yuan, a
graduate student in the Department of Statistics, University
of Wisconsin (see the poster session for additional details).
Kathleen
Kerr
(Department of Biostatistics, University of Washington) katiek@u.washington.edu
Empirical
Evaluation of Methodologies for Microarray Data Analysis, With
Some Thoughts on Statistical Implications Slides:
pdf
This
talk will present recent results of empirical tests of various
methodologies of microarray data analysis. The data are spike-in
assays produced as part of a "standardization" experiment performed
by the Toxicogenomics Research Consortium. Findings support
the use of an intensity-based normalization procedure and provide
strong evidence that the practice of local background subtraction
is detrimental. Statistically, the most interesting findings
pertain to the relative ability of various methodologies for
detecting differentially expressed genes. The speaker will present
these results along with some opinions on directions of research
to advance statistical methodology for microarray data analysis.
Boris
Khots and Dmitriy Khots
(Iowa,USA, bkhots@cccglobal.com
dkhots@blue.weeg.uiowa.edu
Why
Infinite-dimensional Topological Groups may Work for Genetics
Data (poster session)
Slides: html
pdf
ps
ppt
A significant role in mathematical modeling and algorithms for
applications to processing of Genetic data (for example, Gene
expression data) may play infinite-dimensional P-spaces and
connected with them infinite-dimensional P-groups and P-algebras
(B.S. Khots, Groups of local analytical homeomorphisms of line
and P-groups, Russian Math Surveys, v.XXXIII, 3 (201), Moscow,
London, 1978, 189-190). Investigation of the topological-algebraic
properties of P-spaces, P-groups and P-algebras is connected
with the solution of the infinite-dimensional fifth David Hilbert
problem. In Genetic data processing the utilization of the topological-algebraic
properties of P-spaces, P-groups and P-algebras may permit to
find "Gene functionality". We applied these methods to Yeast
Rosetta and Lee-Hood Gene expression data, leukemia ALL-AML
Gene expression data and found the sets of Gene-Gene dependencies,
Gene-Trait dependencies. In particular, accuracy of leukemia
diagnosis is 0.97. On the other hand, Genetics requires a solution
of new mathematical problems. For example, what are the topology-algebraic
properties of a P-group (subgroups, normal subgroups, normal
serieses,P-algebras, subalgebras, ideals, etc) that is finitely
generated by local homeomorphisms of some manifold onto itself?
Pim
(W.W.) Kuurman
(Animal Sciences Group, Wageningen UR, P.O. Box 65, 8200 AB
Lelystad, The Netherlands) Pim.Kuurman@wur.nl
Procedure
for Standardisation and Normalisation of cDNA Microarrays
(poster session)
Poster
file:
pool.pdf
Talk handout: EAAPpresentatie.pdf
EAAPpresentatie.doc
Joint
work with M.H. Pool, B.
Hulsegge, L.L.G Janss, J.M.J.
Rebel, and S. van Hemert.
Expression
levels for large numbers of genes under different conditions
can be measured by using microarrays. In livestock species often
cDNA-arrays are used for this purpose, because the complete
genome sequences are not yet available to engineer oligo-arrays,
and use of cDNA arrays allows the direct use of available cDNA
libraries. However, cDNA-arrays exhibit larger variability than
oligo arrays and therefore require more care in order to reduce
noise, standardise and normalise the data, and require some
different statistical approaches for analysis because two samples
are measured on the same slide, unlike in oligo-array technology.
This poster describes procedures developed to treat such data
consisting of: (1) correction for background using special blank
spots; (2) automatic outlier treatment using iteratively reweighted
analysis to allow for a robust fit, similar to using medians;
(3) a lowess fit to allow for dye-bias on the ratio's with varying
intensity; (4) a procedure to identify poor duplicated values
(1 duplicate is made within slide) fitting a heterogeneous variance
contour to allow for increasing repeatability with increasing
intensity; (5) fitting of a heterogeneous variance contour for
sample values to allow for decreasing variance with increasing
intensity, used to provide weights for a weighted analysis.
The procedure is illustrated on a data set showing differences
in gene expression levels between malabsorption syndrome infected
and control chickens.
Poster
file:
pool.pdf
Talk
handout:
EAAPpresentatie.pdf
EAAPpresentatie.doc
Hongzhe
Li
(Rowe Program in Human Genetics, UC Davis School of Medicine)
hli@ucdavis.edu
Microarray
Time Course Gene Expresssion Studies: Some Problems and Statistical
Methods
Slides: pdf
Since
many biological systems and processes in human health and diseases
are dynamic systems, genome-wide gene expression levels measured
over time can often provide more insights into such systems.
Important examples include developmental process, cell cycle
process and regulation of circadian rhythm. The noisy nature
of microarray data and the potential dependency of the gene
expression measurements over time makes analysis of such micorarray
time course (MTC) gene expression data challenging. In this
talk, I will present some problems and statistical methods for
analyzing such MTC gene expression data. Some details will be
given on the methods of identifying genes with different time
course expression profiles and the methods of identifying periodically
regulated genes.
Wentian
Li (The Robert S Boas Center for Genomics and Human
Genetics, North Shore LIJ Research Institute, USA) wli@watson.nslij-genetics.org
http://www.nslij-genetics.org/wli
Extreme-Value
Distribution Based Gene Selection Criteria for Discriminant
Microarray Data Analysis Using Logistic Regression (poster
session)
Joint
work with Fengzhu Sun (Department
of Biological Sciences, Molecular and Computational Biology
Program, University of Southern California, USA) and Ivo
Grosse (Bioinformatics Center Gatersleben-Halle, Institute
for Plant Genetics and Crop Plant Research, Germany) .
We
present a calculation of the expected maximum-likelihood and
the p-value for the top gene selected by the logistic regression.
This calculation is based on the maximum likelihood of the null
model and the extreme value distribution of chi-square variables.
Based on this calculation, we propose two corresponding gene
selection criteria: the E-criterion and the P-criterion. In
the E-criterion, a gene is selected if its maximum-likelihood
is greater than that of the top gene under the null model. In
the P-criterion, a gene is selected if its p-value according
to the null distribution of the the top gene is smaller than
a pre-determined value. Both gene selection criteria are conservative
because non-top-ranked genes are judged by the expected value
of the top gene. As a result, a much more compact set of genes
is selected.
References:
[1]
W Li, I Grosse (2003), "Gene selection criterion for discriminant
microarray data analysis based on extreme value distributions",
in RECOMB03: Proceedings of the Seventh Annual International
Conference on Computational Biology, pp. 217-223 (ACM Press).
[2]
W Li, F Sun, I Grosse (2003), "Extreme-value distribution based
gene selection criteria for discriminant microarray data analysis
using logistic regression", submitted to Journal of Computational
Biology.
Adriana
Lopez
(Department of Statistics, University of Pittsburgh Pittsburgh,
PA) adl5+@pitt.edu
Cancer
Tumor Classification Using Gene Expression Data (poster
session)
At
the end of the 90's, biotechnologies such as microarrays have
been developed and their use in the research of cancer has increased
because they can lead to a more precise and reliable classification
of cancer tumors. This research was concerned with discriminant
analysis or classification of cancer tumors using expression
genetic data from microarrays in previously known classes, using
kernel density estimation and combination of classifiers based
on this methodology. This technique was compared to other well
known discriminant analysis techniques using the misclassification
proportion, estimated using training and test sets: fixed and
obtained by the 2:1 sampling scheme. An equally efficient performance
of the fixed kernel classifiers and the adaptative kernel classifiers
was observed for the three data sets that were studied and generally,
the kernel classifier was the best nonparametric classifier.
Geoff
McLachlan (Department of Mathematics and the Institute
of Molecular Bioscience, University of Queensland) gjm@maths.uq.edu.au
Classification of Microarray Gene-Expression Data
Slides: html
pdf
ps
ppt
In
the context of cancer diagnosis and treatment, we consider the
problem of classifying a relatively small number of tumour tissue
samples containing the expression data on very many (possibly
thousands) of genes from microarray experiments. For the supervised
problem where there are tumour samples of known classification,
we discuss the need to correct for the selection bias in assessing
the error rate of a prediction rule formed from a small subset
of selected genes. We also consider the unsupervised problem
where the aim is to cluster the tumour samples on the basis
of the gene expressions. The associated problem of assessing
the number of clusters is addressed. Attention is concentrated
on the mixture model-based approach called EMMIX-GENE. Its performance
is demonstrated on various microarray data sets available in
the bioinformatics literature.
Peter
J. Munson
(Mathematical and Statistical Computing Laboratory, DCB, CIT,
NIH, DHHS) munson@helix.nih.gov
Mining
a Gene Expression Database
The
now widespread interest in gene expression is motivated by the
promise of important new findings in the context of disease
or basic biology research. Because of cost constraints, most
designed studies are relatively small, involving from 2 to 100
chips. Pooling results of studies allows one to compare expression
across potentially 1000s of conditions, with greater promise
of additional insights. The NIHLIMS database houses data from
about 30 ongoing studies at NIH comprising about 1500 Affymetrix
chips, and provides a platform for testing data mining aprpoaches.
Serious
data comparability challenges are encountered here, some of
which can be addressed with appropriate data normalization.
We investigate factors which distinguish patterns of expression.
In addition to many technical factors, the cell or tissue type
from which mRNA is prepared seems to be a primary source of
variability. As a consequence, tissue specific genes can be
identified by this approach. Limited demographic information
may be available permitting, for example, the determination
of gender-specific gene expression patterns. In one particular
study, the identification of tissue specific genes in human
was compared to tissue specific genes in rodent for the homologous
tissue, allowing for an evolutionary comparison of the relevant
expression mechanisms.
We
discuss several of the statistical techniques needed to compare
data across studies, and present a list of challenges now facing
data miners.
Ann
L. Oberg (Department of Health Sciences Research,
Division of Biostatistics, The Mayo Clinic ) Oberg.Ann@mayo.edu
http://www.mayo.edu/hsr/people/oberg.html
Joint Estimation of Calibration and Expression for High-density
Oligonucleotide Arrays (poster session)
Joint
work with Karla V. Ballman, Douglas
W. Mahoney, and Terry M. Therneau.
There
is an increasing awareness that the analysis of high-density
oligonucleotide arrays is better modeled as a holistic rather
than a piecemeal process. Affymetrix software summarizes each
chip (including scaling, background subtraction, and removal
of outliers) separately, with the results of that summarization
"passed forward" to the next stage of analysis. Li
(2001) introduced a "model-based" analysis, where
all chips for a given experimental condition were fit in a single
model, giving a more complete and accurate picture of both data
errors and the fit. Chu (2002) recently extended this idea,
using a random-effects model to encompass all chips in an experiment
at once. For all of these, however, normalization of the data
is done as a separate prior process. We propose a method that
integrates the normalization, visualized as chip specific calibration
curves based on differential binding characteristics, along
with model fitting incorporating experimental design in a unified
algorithm. The ability to incorporate experimental design into
both the normalization process and the fit leads to more efficient
and less biased estimates of the tissue gene expressions. Affycomp
results will be presented.
Michael
Ochs (Fox Chase Cancer Center, Philadelphia, PA)
m_ochs@fccc.edu
Encoding
Prior Biological Knowledge in Functional Genomics Analysis
Slides: html
pdf
ps
ppt
Cancer
is a leading cause of death throughout the world. The fundamental
cellular biology underlying the development of cancer is extremely
complex, since cancer arises from a myriad of different cellular
malfunctions. It is clear, however, that cellular signaling
pathways that control cell growth, differentiation, apoptosis,
and motility play a critical role in many cancers. New technologies
such as microarrays and protein arrays offer the possibility
of elucidating key pathways involved in cancer and of monitoring
the effect of targeted therapeutics on those pathways. However,
because of the limited nature of our knowledge of signaling
pathways in humans and high noise levels in the data, difficulties
arise during analysis. The inclusion of prior knowledge can
enhance probabilistic reasoning in such a case. Analysis of
functional genomics data is especially suitable for the inclusion
of prior information, since a vast framework of biological knowledge
exists.
Bayesian
Decomposition is a Markov chain Monte Carlo method that uses
Bayesian statistics to encode prior knowledge. The inclusion
of biological information both during the analysis and when
interpreting patterns identified in the data has greatly increased
the power of the algorithm. This is demonstrated with three
separate data sets. First, the recovery of a pattern related
to the yeast mating pathway is accomplished by use of annotations
from the Yeast Proteome Database. Second, tissue identification
in Black6 mice is used to isolate tissue specific expression
patterns that can be interpreted using gene ontology. Third,
links between genes known to be coregulated in yeast is used
to demonstrate the effect of such prior knowledge on the analysis.
John
Quackenbush
(Department of Mammalian Genomics, The Institute for Genomic
Research (TIGR)) johnq@tigr.org
Beyond
Significance: Integrating Diverse Data Types to Extract Biological
Meaning from Microarrays
Slides:
pdf
Microarray expression analysis has rapidly become a mainstay
in functional genomics laboratories. With the rapid expansion
of this field has come equally rapid advances in statistical
analysis methodologies that have revolutionized the way we design
experiments and analyze data. However, even the best designed,
conducted, and analyzed experiments yield, at best, statistically
significant lists of genes. The scientific challenge we now
face it placing these genes into a broader biological context
through the use of diverse ancillary information. I will present
an overview of the problem with some examples of how we have
integrated diverse data types to add biological meaning to expression
measures.
Marco
F. Ramoni
(Assistant Professor of Pediatrics, Medicine, Oral Medicine,
Infection and Immunity, Harvard Medical School, Boston, MA 02115)
marco_ramoni@harvard.edu
http://chip.tch.harvard.edu/people/marco
Bayesian
Methods for Microarray Data Analysis Slides:
pdf
Data
produced by microarray experiments - measuring thousands of
genes with limited replicates - present unparallel opportunities
to understand the global behavior of the genome and unprecedented
analytical challenges. This talk will introduce a general Bayesian
framework able provide coherent solutions to some critical problems
of microarray data analysis and open new, unexplored avenues
of discovery. The talk will start by describing a Bayesian approach
to the analysis of comparative experiments, able to deliver
high sensitivity and superior reproducibility. It will then
describe a Bayesian solution to clustering gene expression data
and it will introduce a principled probabilistic criterion to
automatically identify the optimal number of clusters underlying
a set of microarray experiments. It will also show how this
clustering method can be naturally extended to profile the temporal
behavior of gene expression dynamics. Finally, the talk will
take this Bayesian framework one step forward, and show how
it can be used to dissect the regulatory mechanisms of gene
expression using a new class of Bayesian networks, called Generalized
Gamma Networks, specifically designed to handle the peculiar
distributional nature of microarray data and the non-linearity
of gene expression control.
Lídia
Rejtö (Statistics
Program, University of Delaware, 214 Townsend Hall, Newark,
DE 19717-1303, USA) rejto@udel.edu
Bayesian
Analysis of Microarrays (poster session)
Microarray
technology enables the assessment of expression patterns of
thousands of genes over time and under multiple conditions.
The analysis of these patterns requires detecting whether observed
differences in expression levels are significant or not. To
perform the analysis, one must first normalize the data. Here
we present a stochastic model offering a method to normalize
the data and to detect differentially expressed genes. The model
is appropriate to deal with more than two experimental conditions
or time series experiments.
We
construct a model to describe the stochastic relationship between
the real and the measured gene-expression levels. We introduce
a Bayesian component, which assumes that there is a prior probability
for the event, that the real expression levels are different
under different conditions. The prior probability of the Bayesian
component is estimated, together with the other model parameters
by using the maximum-likelihood method. Having the estimated
model parameters, we estimate the real gene-expression levels
as conditional expectations. Furthermore, for each gene the
posterior probability of differential expression is given. We
estimated the variances of the estimates of the model parameters
with the help of bootstrapping. The fitted parametric model
was validated by verification of differential gene expression
with real-time quantitative RT-PCR (qRT-PCR) analysis. The comparison
shows that the stochastic model is adequate in identifying differentially
expressed genes on microarrays.
The
software BAM (Bayesian Analysis of Microarrays) is available
at online http://udgenome.ags.udel.edu/~cogburn/Gene_Expression_Studies.htm
or contact with bukszar@eecis.udel.edu.
Joint
with Gábor Tusnády,1
József Bukszár,2
Guang Gao2 and Larry
Cogburn3.
1
Alfréd Rényi Mathematical Institute of the Hungarian
Academy of Sciences, Budapest, P.O. Box 127, H-1364, Hungary
2 Delaware Biotechnology Institute, 15 Innovation
Way, Newark, 19711, USA
3 University of Delaware, Department of Animal and
Food Sciences, Newark, DE 19717, USA
David
M. Rocke (Department of Applied Science (College
of Engineering), Division of Biostatistics (School of Medicine),
and Center for Image Processing and Integrated Computing, University
of California, Davis) dmrocke@ucdavis.edu
http://www.cipic.ucdavis.edu/~dmrocke
Measurement
Errors and Data Transformation for Gene Expression Data, Proteomics
and Metabolomics Data
Slides: pdf
Gene
expression microarrays comprise a suite of related technologies
for measuring the expression of thousands of genes simultaneously
from a single biological sample. There are also numerous other
high-throughput biological assays that can measure large numbers
of proteins, lipids, and other biologically active compounds.
In this talk, I will describe an important statistical challenge
in the use of such data. Using raw data, logarithms, or ratios,
the variability of the measurements is strongly dependent on
the level of expression, causing a failure of the assumptions
of most standard methods of statistical analysis. We present
a solution to this problem via a specially tuned data transformation
and show how it promotes the effectiveness of simple and sophisticated
analyses of the data.
Hae-Hiang
Song (Department of Biostatistics, The Catholic University
of Korea, Seoul 137-701, Korea ) hhsong@catholic.ac.kr
Statistical
Inference Methods for Detecting Altered Gene Associations
(poster session)
Joint
work with Sang-Heon Yoon and Je-Suk
Kim.
In
many gene expression studies, the assumption is that knowledge
of where and when a gene is expressed carries important information
about what the gene does. We consider the problem of understanding
the gene functions with microarray expression data of histological
progressive grades, starting from dysplastic nodule in cirrhotic
liver to hepatocellular carcinoma Edmonson grade III. The statistical
procedures are divided into two parts: First, microarray data
are suitably normalized including a method of analysis of variance
(ANOVA). Much diverse comments are found for the currently used
normalization methods. In order to proceed to the second part
of statistical analyses of gene-pair associations, these normalization
methods need first to be compared. Based on the assumption that
a union set of significant genes from these normalization methods
includes sufficiently general and well defined differentially
expressed genes, the second part of statistical analyses of
searching evidence of altered gene-gene relationships with progression
of disease is carried out. Significantly altered gene-pair associations
are identified with the ratio of gene-pair correlations. When
we use the phrase of "difference between normal and tumor expression
patterns," in a broad sense it contains not only the information
summarized by the first moment of average expression levels,
but also imply correlation changes between two stages, and this
kind of exploration goes on to a higher order moments. The need
to study association changes naturally arises when analyzing
gene expression levels of multiple arrays obtained in different
stages of progression. We identify altered gene-gene relationships
with replicated microarray expression data.
Keywords:
oligonucleotide array, normalization, correlation ratio statistic,
hepatic nodular lesions
Terry
Speed (WEHI, Melbourne and UC Berkeley) terry@wehi.edu.au:
Mining
a Tandem Mass Spectrometry Database to Determine the Trends
and Global Factors Influencing Peptide Fragmentation
Slides: html
pdf
ps
ppt
A
statistical and non-statistical method have been used to analyse
the gas phase fragmentation behavior of protonated peptides
that involves mining a database of several thousand unique product
ion spectra derived from tryptic digestion and low-energy collision
induced dissociation in a quadrupole ion trap mass spectrometer.
This bioinformatic approach has resulted in the derivation of
a ³relative proton mobility scale² that takes into account both
the charge state and the amino acid composition of a peptide,
and provides an effective classification system for categorizing
peptide MS/MS spectra for subsequent data mining and statistical
analysis. We show that the most important factor influencing
fragmentation is proton mobility and that peptides classified
as non-mobile generally give scores below currently acceptable
thresholds using current MS/MS search algorithms. An amino acid
residues preference for N- and/or C-terminal cleavage has been
quantified in accordance with the proton mobility scale and
the trends determined are predictable based on an analysis of
the most abundant cleavage sites. (Joint work with Eugene
A. Kapp, Frédéric Schütz
and Richard J. Simpson)
Mahlet
G. Tadesse (Department of Statistics, Texas A&M University)
mtadesse@stat.tamu.edu http://www.stat.tamu.edu/~mtadesse
A
Bayesian Method for Class Discovery and Gene Selection (poster
session)
Joint
work with Naijun Sha and Marina
Vannucci.
The
analysis of the high-dimensional data (p
n) generated by DNA microarrays poses challenge to standard
statistical methods. This has revived a strong interest in clustering
algorithms. A typical goal in these analyses is the discovery
of new classes of disease and the identification of relevant
genes. Currently, investigators first resort to data filtering
procedures or dimension reduction techniques before clustering
the data. In addition, the clustering algorithms that are widely
used do not provide an objective way to assess the number of
classes. We propose a Bayesian method, which simultaneously
identifies the number of clusters in the data and selects genes
that best discriminate the different groups.
Terry
M. Therneau (Division of Biostatistics, Mayo Clinic)
therneau@mayo.edu
Joint
Calibration and Fitting of Microarray Data
Slides:
pdf
ps
Figure: description
pdf
ps
Joint
work with Karla Ballman and Ann
Oberg.
In biological assays it is common to have a "logistic" shaped
dose- response curve, where the horizontal axis is the true
level of the material we are trying to measure, and the vertical
axis is the value derived from the assay. In ELISA assays, it
is common to put known controls in several of the wells to estimate
the calibration curve for a given plate directly. The analysis
issues have long been known as well; see for instance DJ Finney's
tutorial paper on radioligand assay (Biometrics, 1976). The
non-linearity is most severe when an assay spans a wide range;
with values from 20 to 20,000 we would expect microarrays to
be particularly affected. Plots of log(dose) vs log(response)
from the Affymetrics and Gene Logic spike in data sets show
precisely this shape, completely in agreement with Finney's
observations.
ropriate
normalization would be clear. We fit models that alternate between
estimation of the true level for each probe, using a linear
model incorporating the experimental design of the study, estimation
of the per-chip calibration curves from a plot of "true"
vs observed, normalization of the data based on the calibration
curves, refit of the linear model, etc. When the linear model
is particularly simple, containing only an intercept per probe,
this turns out to be equivalent to the cyclic loess method of
normalization (but computationally much faster).
The exciting aspect of this formulation is that it gives a framework
in which other aspects of the array can be incorporated, e.g.,
joint use of the PM and MM probes, or biochemical data on predicted
backround binding affinity.
Achim
Tresch (Department of Bioinformatics, Fraunhofer
Institute for Algorithms and Scientific Computing (SCAI), Schloss
Birlinghoven 53754 Sankt Augustin Germany) gieger@scai.fhg.de
http://www.scai.fraunhofer.de/profil/mitarbeiter/gieger.html
Using
Text Mining Networks for the Context Specific Interpretation
of Expression Data (poster session)
Joint
work with Christian Gieger, Daniel
Hanisch, Juliane Fluck,
Hartwig Deneke (Fraunhofer-Institute
SCAI, Sankt Augustin, Germany), Tobias
Mittelstädt, and Albert Becker
(Institute for Neuropathology, University Hospital, Bonn, Germany).
Gene
expression data are most often analysed without utilizing biomedical
a-priori knowledge. The inclusion of metabolic, regulatory or
protein-protein interaction networks into the analysis process
itself provides a way to put results of expression experiments
into a biological context. Unfortunately, network information
stored in databases is oftentimes incomplete or not specific
enough with respect to certain species or cell types. For this
reason, we developed text mining methods for the construction
of interaction networks based on biomedical free text. These
methods were applied to the complete set of MEDLINE abstracts
and resulted in a substantial network of protein relations.
In this method, we used an automatically generated and curated
gene/protein dictionary together with a biomedical grammar which
defines rules to extract concepts describing relevant relations
between genes/proteins and other biological entities. The resulting
text mining network can be used for explorative data analysis
by mapping the results of gene expression experiments onto the
network. For this purpose, the ToPNet application was developed.
Besides its visualization capabilities, it is able to identify
sub-networks relevant according to observed expression patterns
by applying a new method called Significant Area Search. Our
approach was successfully applied to data from two sets of gene
expression experiments in the context of epilepsy and brain
cancer research.
Mark
van der Laan (Division of Biostatistics, School of
Public Health, University of California, Berkeley) laan@stat.Berkeley.EDU
Prediction
of Survival Slides:
pdf
We
propose a unified method for cross-validation which also applies
to censored data, and propose a new deletion/substitution/addition
algorithm for nonparametric multivariate regression. This combination
provides us with a new black-box algorithm for multivariate
regression on censored and uncensored outcomes. We show that
the cross-validation selection procedure satisfies an oracle
property in the sense that it performs asymptotically as well
as the best possible selector when given the true data generating
distribution. We also provide the finite sample properties of
this procedure. In addition, we study the properties of the
deletion/substitution/addition algorithm in simulations. We
apply the method to detect binding sites in yeast gene expression
experiments, and predict survival in cancer data sets.
Yee
Hwa (Jean) Yang (Division of Biostatistics, University
of California, San Francisco) jean@biostat.ucsf.edu
Comparing
Normalization Methods Based on Splice Array Experiments
Slides: pdf
There
are many sources of systematic variation in microarray experiments
that affect measured gene expression levels. Normalization is
the term used to describe the process of removing such variations.
In this talk, I will describe a set of experiments based on
splice-specific microarrays. These arrays provide a basis to
investigate the effect of mutations and other factors on splicing
events in the creation of mature mRNA. In particular, the design
of these arrays provides a platform for comparing the performance
of different normalization methods.
Kenny
Q. Ye (Department of Applied Mathematics and Statistics,
SUNY at Stony Brook, Stony Brook, New York, 11794-3600. (631)632-9344,
(631)632-8490(FAX)) kye@ams.sunysb.edu
Pooling
or not Pooling in Microarray Experiments - an Experimental Design
Point of View (poster session)
Joint
work with Anil Dhundale, Department
of Biomedical Engineering and Center for Biotechnology, SUNY
at Stony Brook, Stony Brook, New York 11794-2580, (631) 632-8521,
anil.dhundale@sunysb.edu
Microarray
experiments are often used to detect differences in gene expression
between two populations of cells; a test population versus a
control population. However in many cases, such as individuals
in a population, the biological variability can present changes
that are irrelevant to the question of interest and it then
becomes important to assay many individual samples to collect
statistically meaningfully results. Unfortunately the cost of
performing some types of microarray experiments can be prohibitive.
A potentially effective but not well publicized alternative
is to pool individual RNA samples together for hybridization
on a single microarray. This method can dramatically reduce
the experimental costs while maintaining high power in detecting
the changes in expression levels that relate to the specific
question of interest. In this talk, we will discuss why this
technique works and the optimal design strategy for pooling.
This idea will also be illustrated by a synthetic experiment
and a real experiment that studies Afib (cardiac atrial fibrillation),
a condition that is a serious health condition that affects
a large percent of the population but mechanistically remains
not well understood.
Ming
Yuan
(Department of Statistics, University of Wisconsin, Madison)
yuanm@stat.wisc.edu
Hidden
Markov Models for Microarray Time Course Data in Multiple Biological
Conditions (poster session)
Motivated
by several real applications, an approach is proposed to compare
expression profiles from different biological conditions over
time. It is based on a hidden Markov model (HMM) with states
corresponding to expression patterns across conditions. To investigate
properties of the proposed approach, we have implemented the
HMM assuming a parametric hierarchical mixture model for the
emissions, here intensities. As shown in simulation studies
comparing the HMM approach to one which simply overlooks the
correlation over time, both the sensitivity and the specificity
increase substantially without sacrificing the false discovery
rate. I will present a detailed analysis of the methodology
and its performance.
This
is a joint work with Prof. Kendziorski
(see her talk on Thursday).
Group
Photo Material
from Talks
Statistical
Methods for Gene Expression: Microarrays and Proteomics,
September 29-October 3, 2003
Probability
and Statistics in Complex Systems: Genomics, Networks, and Financial
Engineering, September 1, 2003 - June 30, 2004
|