# Evaluating a Class of Length-Sensitive Algorithms for Sorting by Reversal

Thursday, October 23, 2003 - 9:15am - 9:35am

Keller 3-180

Ron Pinter (Technion-Israel Institute of Technology)

Sorting by reversal (SBR) has been used extensively in comparative genomic studies [3]. Traditionally, bioinformaticians have been trying to minimize the number of reversals and they evaluate results by looking at the trace generated by the algorithm and asking whether it makes biological sense. We have introduced a length sensitive cost measure in an attempt to model the likelihood of reversals based on their length. In this model the cost f(x) of each reversal depends on the length, x, of the reversed sequence; the overall cost of the SBR process is the total of the individual reversals costs.

Initially [4] we looked at f(x)=x, offering a QuickSort-like algorithm which guarantees a provably good approximation of the minimal SBR cost (finding the minimal cost is NP-hard). In response, several biologists suggested we look at the family of functions f(x)=x**alpha. We have developed a class of algorithms [1] that find an approximate cost for any positive value of the exponent alpha, but the question of which value of alpha is best is of great interest.

We decided to make this evaluation by using the cost of sorting one genome to another as a distance between the genomes that is fed to a tool that builds phylogenetic trees, and then compare the results to evolutionary trees found using other methods. This gives rise to numerous methodical and algorithmic issues, such as:

- How many common genes are necessary to draw meaningful conclusions?

- How do we deal with duplicate genes?

- If the number of common genes for the whole dataset under study is too low

- how do we put together partial results (i.e. combining trees that were built on subsets of the sample) and how small can the subsets be?

- Do we really need to rebuild the whole tree or can we accumulate the scores of matches of the partial trees with the reference tree?

- What similarity score between trees is appropriate for this study?

- How do we cope with the fact that our algorithms produce only approximate costs?

But the ultimate question is - how do we scan for the best value of alpha?

The poster will describe the method and the results on two datasets, including the one from [2] which includes 15 genomes, and discuss their merits.

References

[1] Michael A. Bender, Dongdong Ge, Simai He, Haodong Hu, Ron Y. Pinter, Steven Skiena, and Firas Swidan. Improved Bounds on Sorting with Length-Weighted Reversals. To appear in the Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA'04), January 2004.

[2] William Martin, Tamas Rujan, Erik Richly, Andrea Hansen, Sabine Cornelsen, Thomas Lins, Dario Leister, Bettina Stoebe, Masami Hasegawa, and David Penny. Evolutionary analysis of Arabidopsis, cyanobacterial, and chloroplast genomes reveals plastid phylogeny and thousands of cyanobacterial genes in the nucleus. Proc. Natl. Acad. Sci. USA. September 17, 2002; 99 (19): 12246^Ö12251.

[3] Pavel A. Pevzner: Computational Molecular Biology - an Algorithmic Approach, MIT Press, 2000.

[4] Ron Y. Pinter and Steven Skiena. Genomic Sorting with Length-Weighted Reversals. Genome Informatics 13: 103-111 (2002).

Joint work with Michael A. Bender*, Yaniv Berliner**, Dongdong Ge*, Simai He*, Haodong Hu*, Michael Shmoish**, Meir Shoham**, Steven Skiena*, and Firas Swidan**.

* Dept. of Computer Science, SUNY Stony Brook, NY 11794-4400.

** Dept. of Computer Science, Technion, Israel Institute of Technology, Haifa 32000, Israel

Initially [4] we looked at f(x)=x, offering a QuickSort-like algorithm which guarantees a provably good approximation of the minimal SBR cost (finding the minimal cost is NP-hard). In response, several biologists suggested we look at the family of functions f(x)=x**alpha. We have developed a class of algorithms [1] that find an approximate cost for any positive value of the exponent alpha, but the question of which value of alpha is best is of great interest.

We decided to make this evaluation by using the cost of sorting one genome to another as a distance between the genomes that is fed to a tool that builds phylogenetic trees, and then compare the results to evolutionary trees found using other methods. This gives rise to numerous methodical and algorithmic issues, such as:

- How many common genes are necessary to draw meaningful conclusions?

- How do we deal with duplicate genes?

- If the number of common genes for the whole dataset under study is too low

- how do we put together partial results (i.e. combining trees that were built on subsets of the sample) and how small can the subsets be?

- Do we really need to rebuild the whole tree or can we accumulate the scores of matches of the partial trees with the reference tree?

- What similarity score between trees is appropriate for this study?

- How do we cope with the fact that our algorithms produce only approximate costs?

But the ultimate question is - how do we scan for the best value of alpha?

The poster will describe the method and the results on two datasets, including the one from [2] which includes 15 genomes, and discuss their merits.

References

[1] Michael A. Bender, Dongdong Ge, Simai He, Haodong Hu, Ron Y. Pinter, Steven Skiena, and Firas Swidan. Improved Bounds on Sorting with Length-Weighted Reversals. To appear in the Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA'04), January 2004.

[2] William Martin, Tamas Rujan, Erik Richly, Andrea Hansen, Sabine Cornelsen, Thomas Lins, Dario Leister, Bettina Stoebe, Masami Hasegawa, and David Penny. Evolutionary analysis of Arabidopsis, cyanobacterial, and chloroplast genomes reveals plastid phylogeny and thousands of cyanobacterial genes in the nucleus. Proc. Natl. Acad. Sci. USA. September 17, 2002; 99 (19): 12246^Ö12251.

[3] Pavel A. Pevzner: Computational Molecular Biology - an Algorithmic Approach, MIT Press, 2000.

[4] Ron Y. Pinter and Steven Skiena. Genomic Sorting with Length-Weighted Reversals. Genome Informatics 13: 103-111 (2002).

Joint work with Michael A. Bender*, Yaniv Berliner**, Dongdong Ge*, Simai He*, Haodong Hu*, Michael Shmoish**, Meir Shoham**, Steven Skiena*, and Firas Swidan**.

* Dept. of Computer Science, SUNY Stony Brook, NY 11794-4400.

** Dept. of Computer Science, Technion, Israel Institute of Technology, Haifa 32000, Israel