Department of Biomathematical Sciences
Mount Sinai School of Medicine
New York, NY 10029-6574
The ultimate goal of the human genome project is to understand the functioning of living organisms at the molecular, cellular and higher levels. Such understanding holds enormous promise for early detection and treatment of disease. The first step in the genome project has been to sequence the DNA of a variety of organisms, thereby generating an immense quantity of data. Discovering the function of this DNA will depend in large part on computational and mathematical analysis. A very informative type of analysis is the search for repetitive patterns in DNA.
DNA is subject to a variety of mutational mechanisms, some of which have the effect of copying part of the DNA from one location in the sequence into another location. Over time, these originally identical copies diverge because of additional mutations. Evolution which is ever opportunistic, has used these duplication and mutation events to create families of duplicated genes, create modified genes, create new genes and extend and adapt regulatory control structures. Recognizing these duplicated pieces has in many cases simplified the functional analysis of DNA.
One of the less well understood mutational mechanisms is tandem duplication. In this process, a stretch of nucleotides is duplicated to produce two or more adjacent copies, resulting in a tandem repeat. Over time, the copies undergo additional mutations so that typically, multiple approximate tandem copies are present. Tandem repeats occur frequently in the human genome, including the centromeres and telomeres which are important chromosomal components. They have been shown to cause inherited human diseases, may play a variety of regulatory and evolutionary roles, and because of their polymorphic character, are important laboratory tools for linkage analysis and DNA fingerprinting. In this talk I will discuss an efficient algorithm for detecting tandem repeats in genomic sequence data. Detection is based on k-tuple matching and a collection of statistical filtering criteria.
An interesting feature of tandem repeats is that the duplicated copies are preserved together, making it possible to do "phylogenetic analysis" on a single sequence. This involves using the pattern of mutations among the copies to determine a minimal or a most likely history for the repeat. A history tries to describe the interwoven pattern of duplication and mutation events in such a way as to minimize the number of identical mutations that arise independently. In this talk I will also describe approaches to algorithmic reconstruction of a tandem repeat history.