HOME    »    PROGRAMS/ACTIVITIES    »    Annual Thematic Program
IMA Thematic Year on Probability in Complex Systems
Genomics,Networks, and Financial Engineering
September 2003 - June 2004


Logo for  the 2003-2004 IMA Annual Program on Probability and Statistics in Complex Systems: Genomics, Networks, and Financial engineering
complex_poster.pdf    complex_poster.png

2003-2004 Annual Report   pdf

Questions? Contact us at .

Long Term Visitors   Postdoctoral Fellowships   Events Participants

Quick Links to Events

Summer Programs

The year is divided into three components:

Fall Quarter, September-December, 2003: Mathematical & Statistical Problems in Genome Sciences
Winter Quarter, January-March, 2004: Communication Networks
Spring Quarter, April-June, 2004: Quantitative Modeling in Finance and Econometrics

Organizing Committee:
 

Thomas G. Kurtz (Chair)  University of Wisconsin, Madison  kurtz@math.wisc.edu
Marco Avellaneda  Courant Institute, NYU  avellane@cims.nyu.edu
Bruce Hajek  University of Illinois, Urbana  b-hajek@uiuc.edu
http://www.uiuc.edu/~b-hajek/
Richard Karp  UC Berkeley  karp@icsi.berkeley.edu
Sallie Keller-McNulty  Los Alamos National Labs  sallie@lanl.gov
Andrew Lo  MIT  alo@mit.edu
Michael Newton  University of Wisconsin, Madison  newton@biostat.wisc.edu
Simon  Tavaré University of Southern California stavare@gnome.usc.edu
Walter Willinger  AT&T Labs-Research walter@research.att.com


Introduction

The proposed program is devoted to the application of probability and statistics to problems in three areas: the genome sciences, networks and financial engineering. These application areas are all associated with complex systems, and strategies for system analysis will serve as an organizing principle for the program. (By complex systems we mean systems with a very large number of interacting parts such that the interactions are nonlinear in the sense that we cannot predict the behavior of the system simply by understanding the behavior of the component parts.) Furthermore, these areas share the common feature that they are systems for which a huge amount of data is available.Mathematical models developed for these systems must be informed by this data, if they are to provide a basis for scientific understanding of the systems and for critical decision-making about them. The mathematical and statistical foundations of this program will include stochastic modeling and simulation, statistics, and massive data set analysis, as well as dynamical systems, network and graph theory, optimization, control, design of computer and physical experiments, and statistical visualization. The program will be particularly appropriate for probability/statistics postdocs and long-term participants with some background in at least one of the three major areas of application and an interest in developing the integration tools that will provide them with an entrée into modeling/data integration issues in the other areas. There will be extensive tutorials in the application areas.

The health of human populations and biosystems, information networks, and financial systems, is fundamental to the success of modern civilization. A map of the human genome is nearly in hand, but we are just beginning to understand how to harness it. (For example, how does the one-dimensional information encoded in DNA lead to the immensely complicated three-dimensional structure of proteins, the protein folding problem?) Trying to understand the function of a single gene leads to a complicated set of analyses that requires the integration of stochastic and biological models with noisy, high dimensional data coming from multiple sources. Integration of this information for all of the genes, through comparative and evolutionary genomics, is critical in determining the role of gene related diseases in the human population, and in combating these diseases. A federation of 6000 autonomous networks, called the Internet, and wireless communications are on their way to providing anywhere, anytime multimedia communication. Interconnected power networks linking independent power producers with consumers in an environment with diminishing regulation raise many new questions regarding both the physical and economic operation of the electric power system. In few of these systems is there centralized control. How can we ensure that they work properly? In finance, the dynamics of the sequence of events triggered by the default of Russian government bonds in August 1998, demonstrates that the global financial system is an extraordinarily complex network of relations involving broker/dealers, banks, institutional investors, and other counterparties. The global volatility triggered by the default is a wakeup call to society on the importance of a deeper understanding and control of financial systems. In all these areas, issues such as network topology, the "degree of connectedness", computational complexity, and the probability of systemic failure are relevant, as is the capacity to sample a system and store large amounts of data. Furthermore, system constraints create complex dependencies amongst elements of the sampled data. For example coordinated gene expression causes DNA chip measurements to exhibit strong positive dependence amongst genes in common biochemical pathways, communication traffic with many sources and destinations shares common bottleneck links, and serial dependence is clearly present in financial time-series data.

The mathematical sciences, and particularly probabilistic and statistical methods, are key to understanding the dependencies of these systems. Interacting stochastic systems and cellular automata, as well as dynamical systems and partial differential equations are examples of mathematical structures directed at understanding how one part of a system influences other parts and how those influences propagate. Historically, limited computational power restricted the size and complexity of the systems that could be usefully modeled and, in many settings, limited data made it difficult or impossible to evaluate the appropriateness and accuracy of proposed models. An explosion in computational power and other technical advances that support collection of large amounts of data have radically altered this situation. Simultaneous with and in part because of these technological advances, new areas of application have emerged that require the understanding of systems whose size and complexity tests the limits of even the most recent computational, mathematical, and statistical methodologies.

Understanding in the diverse areas of genomics, communication networks, and financial engineering will benefit from the broad view which we propose to adopt in the one year IMA program -- a view based on the development and analysis of stochastic models and the implementation of the essential statistical analysis using advanced computational methods. To avoid stochastic modeling is to proceed at some peril. Pieces of a system might be considered in isolation using rudimentary data analysis, but this isolated approach may not provide the most efficient analysis and, more importantly, may not allow certain critical questions to be addressed. It is a basic premise of stochastic modeling that data are viewed as the realization of a stochastic process. An appropriate modeling framework allows inferences about unknown system elements or decisions about how to manage the system to be expressed in terms of the stochastic process. Indeed, if we can work out properties of the underlying stochastic process, we may come to a better understanding of the entire system. To do so requires sophisticated mathematical techniques. Computational algorithms are crucial for implementing the calculations suggested by the stochastic models. Advanced computer systems not only allow us to collect more data, but they also allow us to run much more sophisticated analyses than have been possible previously. Furthermore, statistical methods provide us not only with estimates or predictions of unknown quantities, but also with precise statements about our corresponding uncertainty.

Participating postdocs will require and will acquire skills in probability and the mathematical analysis of stochastic models, in the development of appropriate stochastic models by careful study of subject matter, in statistical inference, and in computational methods such as optimization and Monte Carlo.

Fall Quarter (September-December 2003)

Mathematical & Statistical Problems in Genome Sciences

A working draft of the human genome was made publicly available in summer 2000, with a final sequence, erring less often than once per 10,000 bases, to follow within two years. More than twenty microbial genomes are already complete, and the sequencing of both plant and animal model organisms is well underway. Coupled with the availability of sequence data are technologies that enable us to measure the simultaneous gene expression pattern in a cell. Obtaining such a mass of data will mark the beginning of a period of exceptional knowledge discovery in biology. Eric Lander of the Whitehead Institute has likened the effect on biology of these new resources to the effect on chemistry of the periodic table. Having a global view, knowing all the genes, their function, their common alleles, and the biochemical pathways in which they participate will have a profound effect on science and medicine. Mathematics and statistics have the potential to have a larger impact in the processing and analysis of genome data than the clearly substantial effect they have had in the fields of molecular biology and genetics to date. (See Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology, Eric S. Lander and Michael S. Waterman, Editors; Committee on the Mathematical Sciences in Genome and Protein Structure Research, National Research Council, 1995.) The function of most genes is unknown, and stochastic modeling may improve the way inference from expression profiles works in the area of functional genomics. Stochastic models have long been used in evolutionary modeling, and new models and computational methods are needed to cope with whole genome comparisons in comparative genomics. Statistical methods are being refined in genetic mapping studies that now can in principle consider hundreds of thousands of markers in an attempt to find genes affecting complex diseases. The problem of inferring protein structure is long-standing and continues to demand the most sophisticated mathematical, statistical, and computational approaches. There are undoubtedly many more mathematical and statistical problems that will arise from the genome sciences.

The purpose of the term is to uncover emerging problems in computational molecular biology. CMB has a long history that includes techniques such as sequence alignment, sequencing, physical mapping, and so on. Each of these has a well-developed set of methods for its analysis (such as BLAST). Here we address the next generation of problems. Two recent examples should serve to illustrate the possibilities: SNP (single nucleotide polymorphism) detection, and DNA microarrays. SNPs are locations in DNA at which individuals vary greatly. There are now several molecular technologies for high throughput SNP detection.  SNPs are used as markers for disease gene mapping, and currently play a central role in drug design in pharmacogenomics. Analyzing these data has provided a number of challenging statistical problems, in part because SNPs are not usually a random survey of molecular variation in the genome. DNA microarrays provide a way to study the relative expression levels of proteins in different biological backgrounds (e.g. cell cycle data, tumor presence/absence). Parallel assays of expression levels for many thousands of genes simultaneously results in high-dimensional, noisy data. Problems involving image analysis, clustering and modeling expression profiles are central to many varied and important uses of arrays in human genetics and molecular biology.

We have outlined four possible workshops below. By the nature of the technologies involved, there are a number of overlapping topics. As in any other emerging science, the problems that are now of interest will probably be replaced by others by 2003, so these topics should be treated as illustrative.

Opening Tutorials/Kickoff

The program will open with a one-week tutorial on "Tools for Model and Data Integration in the Genome Sciences" (including a Statistics Tutorial "Refresher in S-Plus" that will provide useful background for the entire year program), followed by a brief minisymposium on "Information integration technologies for complex systems." The tutorial will be aimed at the postdocs, and others with a probability/statistics background. The purpose of the tutorial is 1) to prepare the IMA postdocs and other IMA participants for the Genomics program, 2) to provide graduate students and faculty from universities everywhere (particularly the IMA participating Institutions) with an entrée into modeling/data integration problems in the Genome Sciences and 3) to publicize to the wider community the importance and intellectual excitement involved in the understanding of these complex systems.

Mathematical Topics in the Genomics Program:
Exploratory multivariate analysis (e.g. clustering; model based methods), log-linear models, hidden Markov models, Markov chain Monte Carlo, graphical models for networks, likelihoods and optimization. Image analysis. Branching process models of cell growth and replication. Stochastic models on trees. Inference for dependent data generated by networks.

Winter Quarter (January - March 2004)

Communication Networks

The Internet and other communication networks are growing and changing in such a way that they present a rapidly moving target for modeling and data collection and analysis. The problems associated with designing, engineering, and managing such rapidly moving and constantly evolving systems have shaped much of networking research in the past and are likely to play an even more important role in the future as the problems acquire a central element of scale, extending well beyond what has previously been considered.For example, there has been a great deal of interest and progress in the past decade in measuring, modeling and understanding the properties and performance implications of actual traffic flows as they traverse individual links or routers within the network. However, with the imminent deployment of novel scalable network measurement infrastructures and newly-designed large-scale network simulators, we will have access to a new generation of data sets of highest-quality network measurements that are of unprecedented volume, are simultaneously collected from a very large number of points within the network, and have an extraordinary high semantic context.This transition from the traditional single link/router-centered view to a more global or network-wide perspective will have profound implications for trying to describe and understand the dynamic nature of large-scale, complex internetworks such as the global Internet, where the interesting problems are those of interactions, correlations, and heterogeneities in time, space, and across the different networking layers.While these next-generation data sets can be fully expected to continue to reveal tantalizing variability, intriguing fluctuations, and unexpected behaviors, they will also raise many new data analysis and modeling issues and challenge the use of established and well-understood techniques.In particular, the problems of explaining why and how some of the observed phenomena occur, of predicting the stability and performance of truly large-scale networks under alternative future scenarios, and of recommending long-term control strategies are certain to generate new research activities in the mathematical and physical sciences and will remain with us for the foreseeable future. Of course, by 2003, the important questions may look very different from the important questions today, but the characteristics of complex models and massive amounts of data will almost certainly remain for the foreseeable future.

The winter program will open with a short course and tutorial on  "The Internet for Mathematicians"   and   "Measurement, Modeling and Analysis of the Internet."   The purpose of the short course and tutorial is 1) to prepare the IMA postdocs and other IMA participants for the Communication Networks program, 2) to provide graduate students and faculty from universities everywhere (particularly the IMA participating Institutions) with an entrée into modeling/data integration problems in Communications Networks and 3) to publicize to the wider community the importance and intellectual excitement involved in the understanding of these complex systems.

Spring Quarter (April-June 2004)

Quantitative Modeling in Finance and Econometrics

This component of the program is concerned with advanced mathematical, statistical and computational methods in finance and econometrics. Finance has been profoundly influenced by relatively new ideas on how to measure the risk and return of investments. Real progress has come by combining new concepts in finance and risk-management with advanced mathematical modeling and an exponential increase in computing power. A more recent development has been the lowering of the cost of acquiring data and information via the Internet. Improved data access allows modelers to implement sophisticated systems, which can be used to make real-time decisions in terms of investing, managing risk, or allocating capital. 
 

Mathematical Challenges--Inverse Problems in Asset Pricing Theory. Financial Economics has a very elegant way of characterizing a system of prices which is consistent with no-arbitrage (no free lunch): namely, the existence of a probability measure on future market scenarios such that any contingent claim can be priced as the expected value of its future cash-flows.  This result is due to K. Arrow and G. Debreu. In modern finance, the Arrow-Debreu paradigm is used for pricing and hedging instruments that share the same underlying risks.  These include, most prominently, derivative securities.  Derivatives always exist in a universe in which the underlying asset or assets are present.  The power of the Arrow-Debreu measure is that it allows us (i) to price derivatives in relation to the underlying security and (ii) to make sure that these prices are not subject to arbitrages, i.e. that we are not systematically losing money by trading at certain levels.  The other remarkable feature of the Arrow-Debreu measures is that they form an "interpolation" between the prices of liquidly traded assets and less liquid assets for which price discovery is more difficult.   In particular, the perturbation of the probability that characterizes the equilibrium gives useful information about the market risk of trading positions.  The problem of interest is:

Construct Arrow-Debreu probabilities that are consistent with concrete market situations involving several traded assets and multiple trading dates.

So far, only small-dimensional systems have been implemented.  It is only in the last few years that we have enough computing power and theoretical understanding to begin to implement large-scale systems. The development of mathematical and computational tools to solve this problem is very important since it is at the crossroads between Asset-Pricing Theory and financial applications of the theory.  Its solution should drive the development of new modeling, statistical and computational methodology, and since similar inverse problems arise in other areas, methods developed here should find broader application.Development of new methods in the context of finance has the added benefit of ensuring that their validity will be thoroughly tested through implementation in the markets.

In the simplest models, prices are given as expectations of functions of a diffusion process.The problem then becomes to find a diffusion process satisfying several moment-type constraints. For example, one may be given m option prices and the characteristics of these contracts.  The goal is to find a diffusion measure that is consistent with the observed prices (in the Arrow-Debreu sense).  This is a mathematically ill posed problem that is isomorphic to finding a probability measure from a few of its moments.  Either no solution exists or there are many possible solutions.  Continuous dependence on the data can be problematic.

Since the early 1990's several solutions have been suggested.  Some are parametric in nature and exploit the structure of the equation in clever ways.  Unfortunately, these approaches are restricted for the most part to model problems in one dimension and to parametric families of distributions that are not suitable for realistic problems.  In reality, the models used by large broker-dealers in financial derivatives make use of multiple risk factors, so we are dealing with multidimensional diffusions and with complex constraints.  The question then becomes: 

Design stable numerical algorithms for selecting and calibrating financial models to market data that can be applied in the presence of multiple risk factors and many market constraints.

Scientific Interest.  The scientific issues that arise in inverse problems in finance are not merely algorithmic.  They touch upon the foundations of the field of financial economics and serve to validate or to invalidate ideas that remain untested in the markets.  Here there is a big difference from physics and engineering.  Whereas in the latter it is possible to repeat experiments under similar conditions and the models are basically mathematizations of physical laws, we know that no experiment in finance or economics can be reproduced exactly as in the past.  We do not even know the relevant state variables, and consequently, the modeling of pricing probabilities and the selection problem become much more challenging and important than in most inverse problems in physics. 
 

Mathematical Challenges--Monte Carlo Simulation:  Asset Pricing, Risk-Management and Asset Allocation in High-dimensional Systems.This second area of problems has been exploding for several years, and recently, there have been several very important developments. 

Longstaff-Schwartz-Carriere algorithm for solving free-boundary problems in MC simulation.   This breakthrough was long-awaited by practicitioners. Monte Carlo (MC) simulation is designed for linear problems (evaluation of high-dimensional integrals).  The use of MC for American-style options requires a new idea, known as Least Squares Monte Carlo, which essentially performs dynamic programming on a set of non-recombining paths with high accuracy.  The mathematical theory and the study of the bias in these clever algorithms is going full speed ahead, since the original paper of Longstaff and Schwartz (1998) came out. The main issue is:

Develop a coherent analysis of Least Squares Monte Carlo algorithms for American options in high dimensional economics.  Develop a theory for understanding numerical errors and statistical biases that arise from dynamic estimation of conditional expectations, early-exercise dates, etc.

Large-Scale Dynamic Asset Allocation Models.  Since the intertemporal CAPM of Merton and Sharpe, people have been trying to apply dynamic programming ideas to solve allocation problems under different investment horizons and budget constraints.  This theory seems to be OK but there are several elements that seem to indicate that there will be much more activity here.  First, the academic papers assume that there are only one or two assets, that strategies are self-financing and that utilities are homogeneous.  All these assumptions are highly unrealistic.  Despite the fact that the papers have been written and the (Nobel) prizes handed out, we expect that computers will finally allow us to actually run investments strategies which are diversified among dozens of assets with reasonable, complex scenarios and intertemporal reallocation according to real-life events.The goal is: 

Develop platforms for large-scale asset allocation models  (~20 to 50 variables) that produce verifiable results.  Include in these models the possibility of decision making by investors and state-contingent optimization.  Include non-self financing portfolios.

 
The spring program will open with a one-week tutorial.   The purpose of the tutorial is 1) to prepare the IMA postdocs and other IMA participants for the Financial Engineering program, 2) to provide graduate students and faculty from universities everywhere (particularly the IMA participating Institutions) with an entrée into modeling and data integration problems in Financial Engineering and 3) to publicize to the wider community the importance and intellectual excitement involved in the understanding of these complex systems.


Long Term Visitors

The following scientists are confirmed or highly likely as long-term visitors during the program. Other long-term visitors are currently being arranged.


poster.html      complex_poster.pdf      complex_poster.png 

Summer Programs

Long Term Visitors   Postdoctoral Fellowships   Events Participants

top of page

Go