HOME    »    SCIENTIFIC RESOURCES    »    Volumes
SCIENTIFIC RESOURCES

Abstracts and Talk Materials
Group Testing Designs, Algorithms, and Applications to Biology
February 13 - 17, 2012

Keywords of the presentation: Sparse covlutions, polynomials, finite fields, length reduction.

Efficient handling of sparse data is a key challenge in Computer Science. Binary convolutions, such as the Fast Fourier Transform or theWalsh Transform are a useful tool in many applications and are efficiently solved. In the last decade, several problems required efficient solution of sparse binary convolutions.

Both randomized and deterministic algorithms were developed for efficiently computing the sparse FFT. The key operation in all these algorithms was length reduction. The sparse data is mapped into small vectors that preserve the convolution result. The reduction method used to-date was the modulo function since it preserves location (of the ”1” bits) up to cyclic shift.

In this paper we present a new method for length reduction - polynomials. We show that this method allows a faster deterministic computation of the sparse FFT than currently known in the literature. It also enables the development of an efficient algorithm for computing the binary sparse Walsh Transform. To our knowledge, this is the first such algorithm in the literature.

(Joint work with Oren Kappah, Ely Porat, and Amir Rothschild)

Keywords of the presentation: RNA structure characterization, high-throughput sequencing, maximum likelihood estimation

New regulatory roles continue to emerge for both natural and engineered noncoding RNAs, many of which have specific secondary and tertiary structures essential to their function. This highlights a growing need to develop technologies that enable rapid and accurate characterization of structural features within complex RNA populations. Yet, available structure characterization techniques that are reliable are also vastly limited by technological constraints, while the accuracy of popular computational methods is generally poor. These limitations thus pose a major barrier to the comprehensive determination of structure from sequence and thereby to the development of mechanistic understanding of transcriptome dynamics. To address this need, we have recently developed a high-throughput structure characterization technique, called SHAPE-Seq, which simultaneously measures quantitative, single nucleotide-resolution, secondary and tertiary structural information for hundreds of RNA molecules of arbitrary sequence. SHAPE-Seq combines selective 2’-hydroxyl acylation analyzed by primer extension (SHAPE) chemical mapping with multiplexed paired-end deep sequencing of primer extension products. This generates millions of sequencing reads, which are then analyzed using a fully automated data analysis pipeline. Previous bioinformatics methods, in contrast, are laborious, heuristic, and expert-based, and thus prohibit high-throughput chemical mapping.

In this talk, I will review recent developments in experimental RNA structure characterization as well as advances in sequencing technologies. I will then describe the SHAPE-Seq technique, focusing on its automated data analysis method, which relies on a novel probabilistic model of a SHAPE-Seq experiment, adjoined by a rigorous maximum likelihood estimation framework. I will demonstrate the accuracy and simplicity of our approach as well as its applicability to a general class of chemical mapping techniques and to more traditional SHAPE experiments that use capillary electrophoresis to identify and quantify primer extension products.

This is joint work with Lior Pachter, Julius Lucks, Stefanie Mortimer, Shujun Luo, Cole Trapnell, Gary Schroth, Jennifer Doudna and Adam Arkin.

Keywords of the presentation: Group testing, Explicit constructions

The basic goal in combinatorial group testing is to identify a set of up to d defective items within a large population of size n >> d using a pooling strategy. Namely, the items can be grouped together in pools, and a single measurement would reveal whether there are one or more defectives in the pool. The threshold model is a generalization of this idea where a measurement returns positive if the number of defectives in the pool exceeds a fixed threshold u, negative if this number is below a fixed lower threshold L 0.

Keywords of the presentation: function evaluation, competitive analysis, Boolean functions, decision trees, adaptive testing, non-uniform costs

The problem of evaluating a function by sequentially testing a subset of variables whose values uniquely identify the function’s value arises in several domains of computer science. A typical example is from the field of automatic diagnosis in Artificial Intelligence. Automatic diagnosis systems employ a sequence of tests and based on their outcomes, they provide a diagnosis (e.g., a patient has cancer or not, a component of a system is failing or not). In general, it is not necessary to perform all the available tests in order to determine the correct diagnosis. Since different tests might incur also very different costs, it is reasonable to look for testing procedures that minimize the cost incurred to produce the diagnosis. Another example is query optimization, a major issue in the area of databases. Query optimization refers to the problem of defining strategies to evaluate queries in order to minimize the user response time. In a typical database query, thousands of tuples are scanned to determine which of them match the query. By carefully defining a strategy to evaluate the attributes, one can save a considerable amount of computational time. In general, it is not necessary to evaluate all attributes of a tuple in order to determine whether it matches the query or not and a smart choice of the attributes to evaluate first can avoid very expensive and/or redundant attribute evaluations. This talk will focus on the function evaluation problem where tests have non-uniform costs and competitive analysis is employed for measuring the algorithm's performance. For monotone Boolean functions we will present an optimal algorithm, i.e., with the best possible competitiveness. We will also discuss the case of some classes of non-monotone functions.

A Group Testing Approach to Corruption Localizing Hashing
February 16, 2012

Keywords of the presentation: Algorithms, Cryptography, Corruption-Localizing Hashing, Group Testing, Superimposed Codes.

Efficient detection of integrity violations is crucial for the reliability of both data at rest and data in transit. In addition to detecting corruptions, it is often desirable to have the capability of obtaining information about the location of corrupted data blocks. While ideally one would want to always find all changes in the corrupted data, in practice this capability may be expensive, and one may be content with localizing or finding a superset of any changes. Corruption-localizing hashing is a cryptographic primitive that enhances collision-intractable hash functions thus improving the detection property of these functions into a suitable localization property. Besides allowing detection of changes in the input data, corruption-localizing hash schemes also obtain some superset of the changes location, where the accuracy of this superset with respect to the actual changes is a metric of special interest, called localization factor. In this talk we will address the problem of designing corruption-localizing hash schemes with reduced localization factor and with small time and storage complexity. We will show that this problem corresponds to a particular variant of nonadaptive group testing, and illustrate a construction technique based on superimposed codes. This group testing approach allowed to obtain corruption-localizing hash schemes that greatly improve on previous constructions. In particular, we will present a corruption-localizing hash scheme that achieves an arbitrarily small localization factor while only incurring a sublinear time and storage complexity.

Poster - Finding one of m defective elements
December 31, 1969

In contrast to the classical goal of group testing we want to find onw defective elements of $D$ defective elements. We examine four different test functions. We give adaptive strategies and lower bounds for the number of tests. We treat the cases if the number of defectives are known and if the number of defectives are bounded.

Poster - Approximation for Nonunique Probe Selection
December 31, 1969

Given a binary matrix $M$, find submatrix with the same number of columns and minimum number of raws such that all boolean sums of at most $d$ columns are distinct. We show that for any fixed $d$, this problem has no polynomial-time $o(log n)$-approximation unless NP-P. Meanwhile, some approximation algorithms are also presented.

Modern non-invasive brain imaging technologies, such as diffusion weighted magnetic resonance imaging (DWI), enable the mapping of neural fiber tracts in the white matter, providing a basis to reconstruct a detailed map of brain structural connectivity networks. Brain connectivity networks differ from random networks in their topology, which can be measured using small worldness, modularity, and high-degree nodes (hubs). Still, little is known about how individual differences in structural brain network properties relate to age, sex, or genetic differences. Recently, some groups have reported brain network biomarkers that enable differentiation among individuals, pairs of individuals, and groups of individuals. In addition to studying new topological features, here we provide a unifying general method to investigate topological brain networks and connectivity differences between individuals, pairs of individuals, and groups of individuals at several levels of the data hierarchy, while appropriately controlling false discovery rate (FDR) errors. We apply our new method to a large dataset of high quality brain connectivity networks obtained from High Angular Resolution Diffusion Imaging (HARDI) tractography in 303 young adult twins, siblings, and unrelated people. Our proposed approach can accurately classify brain connectivity networks based on sex (93% accuracy) and kinship (88.5% accuracy). We find statistically significant differences associated with sex and kinship both in the brain connectivity networks and in derived topological metrics, such as the clustering coefficient and the communicability matrix.

In the past few years, we have experienced a paradigm shift in human genetics. Accumulating lines of evidence have highlighted the pivotal role of rare genetic variations in a wide variety of traits and diseases. Studying rare variations is a needle in a haystack problem, as large cohorts have to be assayed in order to trap the variations and gain statistical power. The performance of DNA sequencing is exponentially growing, providing sufficient capacity to profile an extensive number of specimens. However, sample preparation schemes do not scale as sequencing capacity. A brute force approach of preparing hundredths to thousands of specimens for sequencing is cumbersome and cost-prohibited. The next challenge, therefore, is to develop a scalable technique that circumvents the bottleneck in sample preparation.

My tutorial will provide background on rare genetic variations and DNA sequencing. I will present our sample prep strategy, called DNA Sudoku, that utilizes combinatorial pooling/compressed sensing approach to find rare genetic variations. More importantly, I will discuss several major distinction from the classical combinatorial due to sequencing specific constraints.

Keywords of the presentation: high-throughput sequencing, combinatorial pooling, liquid-handling, compressed sensing

Finding rare genetic variations in large cohorts requires tedious preparation of large numbers of specimens for sequencing. We are developing a solution, called DNA Sudoku, to reduce prep time and increase the throughput of samples. By using a combinatorial pooling approach, we multiplex specimens and then barcode the pools, rather than individuals, for sequencing in a single lane on the Illumina platform.

We have developed a protocol for quantifying, calibrating, and pooling DNA samples using a liquid-handling robot, which has required a significant amount of testing in order to reduce volume variation. I will discuss our protocol and the steps we have taken to reduce CV. For accurate decoding and to reduce the possibility of specimen dropout, it is important that the DNA samples are accurately quantified and calibrated so that equal amounts can be pooled and sequenced. We can determine the number of carriers in each pool from sequencing output and reconstruct the original identity of individual specimens based on the pooling design, allowing us to identify a small number of carriers in a large cohort.

Superimposed codes
February 14, 2012

There are many instances in Coding Theory when codewords must be restored from partial information, like defected data (error correcting codes), or some superposition of the strings. These lead to superimposed codes, a close relative of group testing problems.

There are lots of versions and related problems, like Sidon sets, sum-free sets, unionfree families, locally thin families, cover-free codes and families, etc. We discuss two cases cancellative and union-free codes.

A family of sets Ƒ (and the corresponding code of 0-1 vectors) is called union-free if A ∪ B≠ C ∪ D and A,B,C,D ∈ F imply {A,B} = {C,D}. Ƒ is called t-cancellative if for all distict t + 2 members A1, … ,At and B,C ∈ Ƒ

A1 ∪ ... ∪ At ∪ B ≠ A1 ∪ … At ∪C:

Let ct(n) be the size of the largest t-cancellative code on n elements. We significantly improve the previous upper bounds of Körner and Sinaimeri, e.g., we show c2(n) ≤ 20:322n (for n > n0).

Tutorial: Locally decodable codes
February 17, 2012

Keywords of the presentation: error correcting codes, local decoding

Tutorial: Locally decodable codes

Locally decodable codes are error correcting codes with the extra property that, in order to retrieve the correct value of any one position of the input with high probability, it is sufficient to read a small number of positions of the corresponding, possibly corrupted codeword. In applications involving large data sets it is desirable to have decoding procedures that work in sublinear time, that is, without having to access all of the encoded data. Locally decodable codes allow very efficient decoding procedures for recovering individual parts of the original data: not only with sublinear, but sometimes even small constant running times. I will survey constructions of locally decodable codes and the currently known tradeoffs between parameters of such codes.

Tutorial: Sparse signal recovery
February 13, 2012

Keywords of the presentation: streaming algorithms, group testing, sparse signal recovery, introduction.

My talk will be a tutorial about sparse signal recovery but, more importantly, I will provide an overview of what the research problems are at the intersection of biological applications of group testing, streaming algorithms, sparse signal recovery, and coding theory. The talk should help set the stage for the rest of the workshop.

Despite the rapid decrease in sequencing costs, sequencing a large group of individuals is still prohibitively expensive. Recently, several sophisticated pooling designs were suggested that can identify carriers of rare alleles in large cohorts with a significantly smaller number of lanes, thus dramatically reducing the cost of such large scale genotyping projects. These approaches all use combinatorial pooling designs where each individual is either present in a pool or absent from it. One can then infer the number of carriers in a pool, and by combining information across pools, reconstruct the identity of the carriers.

We show that one can gain further efficiency and cost reduction by using “weighted” designs, in which different individuals donate different amounts of DNA to the pools. Intuitively, in this situation the number of mutant reads in a pool does not only indicate the number of carriers, but also of the identity of the carriers.

We describe and study a simple but powerful example of such weighted designs, with non-overlapping pools. We demonstrate that even this naive approach is not only easier to implement and analyze but is also competitive in terms of accuracy with combinatorial designs when identifying very rare variants, and is superior to the combinatorial designs when genotyping more common variants.

We then discuss how weighting can be further incorporated into existing designs to increase their accuracy and demonstrate the resulting improvement in reconstruction efficiency using simulations. Finally, we argue that these weighted designs have enough power to facilitate detection of common alleles, so they can be used as a cornerstone of whole-exome or even whole-genome sequencing projects.

Keywords of the presentation: combinatorial pooling, DNA sequencing and assembly, genomics, next-generation sequencing,

We propose a new sequencing protocol that combines recent advances in combinatorial pooling design and second-generation sequencing technology to efficiently approach de novo selective genome sequencing. We show that combinatorial pooling is a cost-effective and practical alternative to exhaustive DNA barcoding when dealing with hundreds or thousands of DNA samples, such as genome-tiling gene-rich BAC clones. The novelty of the protocol hinges on the computational ability to efficiently compare hundreds of million of short reads and assign them to the correct BAC clones so that the assembly can be carried out clone-by-clone. Experimental results on simulated data for the rice genome show that the deconvolution is extremely accurate (99.57% of the deconvoluted reads are assigned to the correct BAC), and the resulting BAC assemblies have very high quality (BACs are covered by contigs over about 77% of their length, on average). Experimental results on real data for a gene-rich subset of the barley genome confirm that the deconvolution is accurate (almost 70% of left/right pairs in paired-end reads are assigned to the same BAC, despite being processed independently) and the BAC assemblies have good quality (the average sum of all assembled contigs is about 88% of the estimated BAC length).

Joint work with D. Duma (UCR), M. Alpert (UCR), F. Cordero (U of Torino), M. Beccuti (U of Torino), P. R. Bhat (UCR and Monsanto), Y. Wu (UCR and Google), G. Ciardo (UCR), B. Alsaihati (UCR), Y. Ma (UCR), S. Wanamaker (UCR), J. Resnik (UCR), and T. J. Close (UCR).

Preprint available at http://arxiv.org/abs/1112.4438

Poster - Upgraded Separate Testing of Inputs in Compressive Sensing
December 31, 1969

Screening experiments (SE) deal with finding a small number s Active Inputs (AIs) out of a vast total amount t of inputs in a regressionmodel. Of special interest in the SE theory is finding the so-called maximal rate (capacity). For a set of t elements, denote the set of its s-subsets by (s, t). Introduce a noisy system with t binary inputs and one output y. Suppose that only s 0. The f-capacity Cf (s) = limt!1(log t/Nf (s, t, )) for any > 0 is the ‘limit for the per- formance of the f-analysis’. We obtained tight capacity bounds [4], [5] by formalizing CSS as a special case of Multi-Access Communication Channels (MAC) of information transmission capacity region construction developed by R. Ahlswede in [1] and comparing CSS’ maximal rate (capacity) with small error for two practical methods of outputs’ analysis under the optimal CSS design motivated by applications like [2] . Recovering Active Inputs with small error probability and accurate parameter estimation are both possible with rates less than capacity and impossible with larger rates. A staggering amount of attention was recently devoted to the study of compressive sensing and related areas using sparse priors in over parameterized linear models which may be viewed as a special case of our models with continuous input levels. The threshold phenomenon was empirically observed in early papers [3], [6] : as the dimension of a randominstance of a problem grows there is a sharp transition from successful recovery to failure as a function of the number of observations versus the dimension and sparsity of the unknown signal. Finding this threshold is closely related to our capacity evaluation. Some threshold bounds for the compressive sensing were made using standard information-theoretic tools, e.g. in [15]

A staggering amount of attention was recently devoted to the study of compressed sensing and related areas using sparse priors in over parameterized linear models under versions of linear programming (LP) analysis. More recently popularity of the sparsity approach for the classical model of group testing also increased. The threshold phenomenon was empirically observed in early papers of Donoho et al : as the dimension of a random instance of a problem grows there is a sharp transition from successful recovery to failure as a function of the number of observations versus the dimension and sparsity of the unknown signal. There are strong parallels between this threshold behavior and earlier Shannon's justification for similar phenomenon in terms of the capacity of the information transmission. We outline sharp asymptotical bounds for general models obtained by 1979 that were inspired by the Multi-Access Capacity Region construction that can greatly enhance the current understanding of recovery of active inputs in sparse compressive sensing. These bounds were obtained for asymptotically optimal designs which include random designs and thus imply upper bounds for performance of recovery under arbitrary designs. These optimal designs are a natural choice in settings where the experiment can be predesigned, providing in addition a competitive performance of a simplified analysis method: separate testing of inputs (STI). STI originated in 1959 and can be further traced back to Fisher's idea of randomization in estimation. The STI capacity exceeds that of LP relaxation for randomized designs in a wide range of models and admits several straightforward generalizations including significantly more powerful and still computationally feasible greedy version. This will be demonstrated in our simulations.

Keywords of the presentation: group testing, MAC channel, quantization, graphical models

We consider a novel group testing framework where defectives obey a probabilistic model in which the number and set of defectives is governed by a graphical model. We furthermore assume that the defectives have importance factors that influence their strength on the test outcomes and the accuracy with which the outcomes are read. This kind of scenario arises, for example, in MAC channels with networked users using different communication powers, or in DNA pooling schemes which involve large families.

Keywords of the presentation: dimensionality reduction, johnson-lindenstrauss, numerical linear algebra, massive data

The Johnson-Lindenstrauss (JL) lemma (1984) states that any n points in d-dimensional Euclidean space can be embedded into k = O((log n)/eps^2) dimensions so that all pairwise distances are preserved up to 1+/-eps. Furthermore, this embedding can be achieved via a linear mapping. The JL lemma is a useful tool for speeding up solutions to several high-dimensional problems: closest pair, nearest neighbor, diameter, minimum spanning tree, etc. It also speeds up some clustering and string processing algorithms, reduces the amount of storage required to store a dataset, and can be used to reduce memory required for numerical linear algebra problems such as linear regression and low rank approximation.

The original proofs of the JL lemma let the linear mapping be specified by a random dense k x d matrix (e.g. i.i.d. Gaussian entries). Thus, performing an embedding requires dense matrix-vector multiplication. We give the first construction of linear mappings for JL in which only a subconstant fraction of the embedding matrix is non-zero, regardless of how eps and n are related, thus always speeding up the embedding time. Previous constructions only achieved sparse embedding matrices for 1/eps >> log n.

This is joint work with Daniel Kane (Stanford).

Keywords of the presentation: biological networks, graph algorithms, network alignment

Sequence-based computational approaches have revolutionized biological understanding. However, they can fail to explain some biological phenomena. Since proteins aggregate to perform a function instead of acting in isolation, the connectivity of a protein interaction network (PIN) will provide additional insight into the inner working on the cell, over and above sequences of individual proteins. We argue that sequence and network topology give insights into complementary slices of biological information, which sometimes corroborate each other, but sometimes do not. Hence, the advancement depends on the development of sophisticated graph-theoretic methods for extracting biological knowledge purely from network topology before being integrated with other types of biological data (e.g., sequence). However, dealing with large networks is non-trivial, since many graph-theoretic problems are computationally intractable, so heuristic algorithms are sought.

Analogous to sequence alignments, alignments of biological networks will likely impact biomedical understanding. We introduce a family of topology-based network alignment (NA) algorithms, (that we call GRAAL algorithms), that produces by far the most complete alignments of biological networks to date: our alignment of yeast and human PINs demonstrates that even distant species share a surprising amount of PIN topology. We show that both species phylogeny and protein function can be extracted from our topological NA. Furtermore, we demonstrate that the NA quality improves with integration of additional data sources (including sequence) into the alignment algorithm: surprisingly, 77.7% of proteins in the baker’s yeast PIN participate in a connected subnetwork that is fully contained in the human PIN suggesting broad similarities in internal cellular wiring across all life on Earth. Also, we demonstrate that topology around cancer and non-cancer genes is different and when integrated with functional genomics data, it successfully predicts new cancer genes in melanogenesis-related pathways.

Keywords of the presentation: protein interaction networks, kinetics, affinity, protein arrays, surface plasmon resonance

Angiogenesis, consisting in the formation of blood vessels from preexisting ones, is of crucial importance in pathological situations such as cancer and diabetes and is a therapeutic target for these two diseases. We have developed protein and sugar arrays probed by surface plasmon resonance (SPR) imaging to identify binding partners of proteins, polysaccharides and receptors regulating angiogenesis. Interactions collected from our own experiments and from literature curation have been used to build a network comprising protein-protein and protein-polysaccharide interactions. To switch from a static to a dynamic and quantitative interaction network, we have measured kinetics and affinity of interactions by SPR to discriminate transient from stable interactions and to prioritize interactions in the network. We have also identified protein interaction sites either experimentally or by molecular modeling to discriminate interactions occurring simultaneously from those taking place sequentially. The ultimate step is to integrate, using bioinformatics tools, all these parameters in the interaction network together with inhibitors of these interactions and with gene and protein expression data available from Array Express or Gene Expression Omnibus and from the Human Protein Atlas. This dynamic network will allow us to understand how angiogenesis is regulated in a concerted fashion via several receptors and signaling pathways, to identify crucial interactions for the integrity of the network that are potential therapeutic targets, and to predict the side effects of anti-angiogenic treatments.

Keywords of the presentation: Code Concatenation, Coding theory, Group Testing, List Decoding, List Recovery

Group testing was formalized by Dorfman in his 1943 paper and was originally used in WW-II to identify soldiers with syphilis. The main insight in this application is that blood samples from different soldiers can be combined to check if at least one of soldiers in the pool has the disease. Since then group testing has found numerous applications in many areas such as (computational) biology, combinatorics and (theoretical) computer science.

Theory of error-correcting codes, or coding theory, was born in the works of Shannon in 1948 and Hamming in 1950. Codes are ubiquitous in our daily life and have also found numerous applications in theoretical computer science in general and computational complexity in particular.

Kautz and Singleton connected these two areas in their 1964 paper by using "code concatenation" to design good group testing schemes. All of the (asymptotically) best know explicit constructions of group testing schemes use the code concatenation paradigm. In this talk, we will focus on the "decoding" problem for group testing: i.e. given the outcomes of the tests on the pools, identify the infected soldiers. Recent applications of group testing in data stream algorithm require sub-linear time decoding, which is not guaranteed by the traditional constructions.

The talk will first survey the Kautz-Singleton construction and then will will show how recent developments in list decoding of codes lead in a modular way to sub-linear time decodable group testing schemes.

Keywords of the presentation: superimposed codes, screening designs, rate of code, threshold designs and codes

We will discuss superimposed codes and non-adaptive group testing designs arising from the potentialities of compressed genotyping models in molecular biology. The given survey is also motivated by the 30th anniversary of our recurrent upper bound on the rate of superimposed codes published in 1982.

Testing Boolean functions
February 14, 2012

I will talk about property testing of Boolean functions, concentrating on two topics: testing monotonicity (an example of a combinatorial property) and testing the property of being a low-degree polynomial (an algebraic property).

Keywords of the presentation: Genomewide association studies, Privacy, Pooled designs, Hypothesis testing, Local Asymptotic normality

Statistical power to detect associations in genome-wide association studies can be enhanced by combining data across studies in meta-analysis or replication studies. Such methods require data to flow freely in the scientific community, however, and this raises privacy concerns.

Till recently, many studies pooled individuals together, making only the allele frequencies of each SNP in the pool publicly available. However a technique that could be used to detect the presence of individual genotypes from such data prompted organizations such as the NIH to restrict public access to summary data . To again allow public access to data from association studies, we need to determine which set of SNPs can be safely exposed while preserving an acceptable level of privacy.

To answer this question, we provide an upper bound on the power achievable by any detection method as a function of factors such as the number and the allele frequencies of exposed SNPs, the number of individuals in the pool, and the false positive rate of the method. Our approach is based on casting the problem in a statistical hypothesis testing framework for which the likelihood ratio test (LR-test) attains the maximal power achievable.

Our analysis provides quantitative guidelines for researchers to make SNPs public without compromising privacy. We recommend, based on our analysis, that only common independent SNPs be exposed. The final decision regarding the exposed SNPs should be based on the analytical bound in conjunction with empirical estimates of the power of the LR test. To this end, we have implemented a tool, SecureGenome, that determines the set of SNPs that can be safely exposed for a given dataset.

From screening clone libraries to detecting biological agents
February 17, 2012

Keywords of the presentation: (generalized) group testing, screening clone libraries, DNA microarrays

Group testing has made many contributions to modern molecular biology. In the Human Genome Project large clone libraries where created to amplify DNA. Widely used group testing schemes vastly accelerated the detection of overlaps between the individual clones in these libraries through experiments, realizing savings both in effort and materials.

Modern molecular biology also contributed to group testing. The problem of generalized group testing (in the combinatorial sense) arises naturally, when one uses oligonucleotide probes to identify biological agents present in a sample. In this setting a group testing design cannot be chosen arbitrarily. The possible columns of a group testing design matrix are prescribed by the biology, namely by the hybridization reactions between target DNA and probes

Keywords of the presentation: compressed sensing, group testing, genetics, rare alleles

Identification of rare variants by resequencing is important both for detecting novel variations and for screening individuals for known disease alleles. New technologies enable low-cost resequencing of target regions, although it is still prohibitive to test more than a few individuals. We propose a novel pooling design that enables the recovery of novel or known rare alleles and their carriers in groups of individuals. The method is based on combining next-generation sequencing technology with a Compressed Sensing (CS) approach. The approach is general, simple and efficient, allowing for simultaneous identification of multiple variants and their carriers. It reduces experimental costs, i.e., both sample preparation related costs and direct sequencing costs, by up to 70 fold, and thus allowing to scan much larger cohorts. We demonstrate the performance of our approach over several publicly available data sets, including the 1000 Genomes Pilot 3 study. We believe our approach may significantly improve cost effectiveness of future association studies, and in screening large DNA cohorts for specific risk alleles.

We will present initial results of two projects that were initiated following publication. The first project concerns identification of de novo SNPs in genetic disorders common among Ashkenazi Jews, based on sequencing 3000 DNA samples. The second project in plant genetics involves identifying SNPs related to water and silica homeostasis in Sorghum bicolor, based on sequencing 3000 DNA samples using 1-2 Illumina lanes.

Joint work with Amnon Amir from the Weizmann Institute of Science, and Or Zuk from the Broad Institute of MIT and Harvard

Keywords of the presentation: combinatorial group testing, smart-pooling, interactome mapping

Group testing, also know as smart-pooling, is a promising strategy for achieving high efficiency, sensitivity, and specificity in systems-level projects. It consists in assaying well-chosen pools of probes, such that each probe is present in several pools, hence tested several times. The goal is to construct the pools so that the positive probes can usually be directly identified from the pattern of positive pools, despite the occurrence of false positives and false negatives. While striving for this goal, two interesting mathematical or computational problems emerge: the pooling problem (how should the pools be designed?), and the decoding problem (how to interpret the outcomes?). In this talk I will discuss these questions and the solutions we have proposed: a flexible and powerful combinatorial construction for designing smart-pools (the Shifted Transveral Design, STD), and an efficient exact algorithm for interpreting results (interpool). I will then present the results of validation experiments that we have performed in the context of yeast two-hybrid interactome mapping.

Tutorial: The Data Stream Model
February 16, 2012

The data stream model has emerged as a way of analyzing algorithmic efficiency in the presence of massive data sets. Typically the algorithm is allowed a few (usually one) passes over the input, and must use limited memory and have very fast per input item processing time. I will give a survey of algorithms and lower bounds in this area, with an emphasis on problems such as norm estimation, numerical linear algebra, empirical entropy, l_p-sampling, and heavy hitters. Time-permitting I'll also discuss the extension to functional monitoring, in which there are multiple sites each with a data stream and a central coordinator should continuously maintain a statistic over the union of the data streams.

Keywords of the presentation: metagenomics, sparse representation, compressed seqeuncing

Determining the identities and frequencies of species present in a sample is a central problem in metagenomics, with scientific, environmental and clinical implications.

A popular approach to the problem is sequencing the Ribosomal 16s RNA gene in the sample using universal primers, and using variation in the gene's sequence between different species to identify the species present in the sample. We present a novel framework for community reconstruction, based on sparse representation; while millions of microorganisms are present on earth, with known 16s sequences stored in a database, only a small minority (typically a few hundreds) are likely to be present in any given sample,

We discuss the statistical framework, algorithms used and results in terms of accuracy and species resolution.

 Connect With Us: Go
 © 2015 Regents of the University of Minnesota. All rights reserved. The University of Minnesota is an equal opportunity educator and employer Last modified on October 06, 2011 Twin Cities Campus:   Parking & Transportation   Maps & Directions Directories   Contact U of M   Privacy