February 27 - March 2, 2012
Network dynamics may be viewed as a process of change in the edge
structure of a network, in the vertex set on which edges are defined,
or in both simultaneously. Though early studies of such processes were
primarily descriptive, recent work on this topic has increasingly
turned to formal statistical models. While showing great promise, many
of these modern dynamic models are computationally intensive and scale
very poorly in the size of the network under study and/or the number
of time points considered. Likewise, currently employed models focus
on edge dynamics, with little support for endogenously changing vertex
sets. Here, we show how an existing approach based on logistic
network regression can be extended to serve as a highly scalable
framework for modeling large networks with dynamic vertex sets. We
place this approach within a general dynamic exponential family (ERGM)
context, clarifying the assumptions underlying the framework (and
providing a clear path for extensions), and show how model assessment
methods for cross-sectional networks can be extended to the dynamic
case. To illustrate this method and accompanying adequacy assessment
we apply dynamic logistic network regression on first a classic data
set involving interactions among windsurfers on a California beach,
and second on a large network of organizational collaboration
occurring during the 2005 Hurricane Katrina disaster. The
organizational collaboration network allows us to demonstrate that
network vital dynamics are a key component of network evolution, and
larger, system-level factors (e.g., geography) are critical for
accurate prediction of network structure. Implications of these
findings on the greater network literature and future of data
collection are also discussed.
Keywords of the presentation: molecular networks, process models, biological pathways, disease mechanisms
Human genetics research traditionally identifies genes underlying a phenotype using association (correlation) analysis. A gene mutation within a population is associated with a phenotype, such as a disease. Biological pathways represent knowledge about molecules, processes and their interactions. Maps of such pathways are used to design and analyze experiments, and for predicting the behavior of biological systems. Pathway information can help explain how a genetic perturbation leads to an altered phenotype, and thus helps move from association (correlation) at the genetic variant or gene levels to mechanism (causality) at the cellular and physiological levels. Pathways are represented as networks, but capture process information (one step leads to another). The increasing amount of pathway information available in public databases represents a major opportunity for network research, as network analysis algorithms must be adapted to pathway information. We are building a number of database and software resources to help access, visualize and analyze pathway information, including Pathway Commons as a convenient single point of access to diverse biological pathway information translated to a common data language (BioPAX), GeneMANIA and Cytoscape. These collaborative projects are important steps towards the development of a complete and integrated computable map of the cell across all species and developmental stages.Read More...
Keywords of the presentation: bioinformatics, stochastic block model, clustering, community detection
Biological networks are dynamic, driven by processes such as development and disease that change the expressed genes and proteins and modify interactions. Understanding how networks remodel could reveal the etiology of developmental disorders and suggest new drug targets for infectious disease. We present new methods that predict how networks change over time through joint analysis of time-domain and static data. For each single network snapshot, we infer a hierarchical stochastic block model that describes patterns of heterogeneous edges connecting inferred groups of vertices. Across snapshots we introduce a time-evolution operator similar to a path integral or hidden Markov model for group membership. The method is fully Bayesian, with no adjustable parameters. Our methods provide superior performance for predicting new interactions and protein complex co-membership. Applied to dynamic data, the method reveals new regulatory mechanisms for controlling the activity of multi-protein complexes. These methods may have broad applicability to social networks and other dynamic networks.Read More...
Previous related work:
GeneMANIA (http://www.genemania.org) is a flexible tool providing researchers with a single point of access for generating hypotheses about gene function by querying large‐scale publicly available genomics and proteomics datasets. GeneMANIA finds other genes that are related to a set of input genes, using a very large set of functional association data. Association data include protein and genetic interactions, pathways, co-expression, co-localization and protein domain similarity. You can use GeneMANIA to find new members of a pathway or complex, find additional genes you may have missed in your screen or find new genes with a specific function, such as protein kinases. Your question is defined by the set of genes you input. If members of your gene list make up a protein complex, GeneMANIA will return more potential members of the protein complex. If you enter a gene list, GeneMANIA will return connections between your genes, within the selected datasets.
Genetic interactions provide a powerful perspective into biological processes that is fundamentally different from other high-throughput genome-wide studies. Recently, Synthetic Genetic Array technology was used to produce the first genome scale map of digenic genetic interactions, which covered ~5.4 million genetic interactions or about ~30% of all possible gene pairs in yeast. This provides a first opportunity for a global, unbiased assessment of the structure of the genetic interaction network and the relationship between this structure and individual gene function. We developed a data mining approach based on association rule learning to exhaustively discover all block structures within the yeast genetic interaction network, producing a complete modular decomposition of the network. We find that genetic interaction hubs can be clearly differentiated into distinct classes of hubs based on their modular structure. Moreover, module membership provides a specific and unbiased assessment of the prevalence of multi-functionality among genes: we find that genes participate in far more functions or contexts than was previously appreciated. In addition, we find that genetic interactions contained within structured modules exhibit strikingly different functional properties relative to isolated interactions, providing insight into the evolution and functional divergence of duplicate genes. Finally, we used module membership to explore the evolution of module redundancy within the cell and find that while there is often age coherency within modules, modules that buffer each other often arise at different times indicating that much cellular redundancy is a product of subsequent diversification.
Keywords of the presentation: dynamic social networks, temporal networks, communities, stream discretization, periodic subgraphs
Computation has fundamentally changed the way we study nature. Recent
breakthroughs in data collection technology, such as GPS and other
mobile sensors, high definition cameras, satellite images, and genotyping, are
giving biologists access to data about wild populations, from genetic to
social interactions, which are orders of magnitude richer than any
previously collected. Such data offer the promise of answering some of
the big questions in population biology: Why do animals form social
groups and how do genetic ties affect this process? Which individuals
are leaders and to what degree do they control the behavior of others?
How do social interactions affect the survival of a species?Read More...
Unfortunately, in this domain, our ability to analyze data lags
substantially behind our ability to collect it. In this talk I will show
how computational approaches can be part of every stage of the
scientific process, from data collection (identifying individual zebras
from photographs) to hypothesis formulation (by designing a novel
computational framework for analysis of dynamic social networks).
Keywords of the presentation: Influence Maximization, Viral Marketing, Diffusion
One of the key objectives of viral marketing is to identify a small set of users in a social network, who when convinced to adopt a product will influence others leading to a large number of adoptions in an expected sense. The seminal work of Kempe et al. approaches this as the problem of influence maximization that tacitly assumes that a user who is influenced (or, informed) about a product necessarily adopts it and encourages her friends to do the same. However, an influenced user may not adopt the product herself, and yet form an opinion based on the experiences of her friends, and share this opinion with others. Furthermore, a user who adopts the product may not like the product and hence not encourage her friends to adopt it. Previous works do not account for these phenomena.
We argue that it is important to distinguish product adoption from influence. We propose a model that factors in a user’s experience (or projected experience) with a product. We adapt the classical Linear Threshold (LT) propagation model by defining an objective function that explicitly captures product adoption, as opposed to influence. We show that under our model, adoption maximization is NP-hard and the objective function is monotone and submodular, thus admitting an approximation algorithm. Our empirical results on three datasets show that our model is able to distinguish between influence and adoption, and predict product adoption much more accurately than the classical LT model.
This is joint work with Amit Goyal and Laks V. S. Lakshmanan.
Understanding how information propagates in a social network has been an active area of research over the last few decades. Early mathematical models were developed to understand how consensus is reached in simple networks. With the advent of social media which can connect millions of users, there has been resurgent interest in the problem. For instance, Frongillo et al., have examined how individuals in a social network can optimally exchange information about a dynamically changing parameter.
It is frequently assumed that discussion improves the decisions of individuals. However, such exchanges can correlate the beliefs of different individuals in a social network, and negatively impact a collective decision. Indeed, even individual decisions may suffer after an exchange of information. Bahrami group  observed that discussion can have a positive impact on decision making only if the observers have approximately equal competence. Redundancies are likely to be present in social networks, specifically in large networks with many edges. In such situations, how should observers integrate incoming information to achieve the best possible estimate and reach the best decisions?
We study a simple graphical Bayesian model of interacting observers who try to estimate a value in the outside world with their own observations and their local neighborhood information. We study in detail feed forward and recurrent network and establish a general result on how the individuals can achieve optimal Bayesian inference in the presence of redundancies. Such redundancies impact the performance of even optimal observers. We also consider the propagation of individuals biases and priors in making decisions and show how certain network structures impact an individuals’ maximum-likelihood (ML) estimate.
 B. Bahrami, K. Olsen, P. E.Latham, A. Roepstorff, G. Rees, C.D.Frith, Optimally Inter- acting Minds. Science Vol 329, 2010
 R.M.Frongillo, G.S.Schoenebeck, O.Tamuz. Social Learning in a Changing World arXiv:1109.5482v1, cs.SI, 2011
A popular approach to map protein-protein interaction networks is called affinity-purification followed by mass spectrometry (AP-MS), where interactors of a given protein of interest (bait) are identifying by pulling down a tagged version of the bait, together with any protein attached to it, and identifying the extracted proteins by mass spectrometry. In this talk, I will discuss statistical and algorithmic problems related to the inference and analysis of interaction networks obtained by AP-MS. I will first discuss our Bayesian approach to separate true interactors from contaminants. AP-MS are subject to identifying direct as well as indirect interactions; separating the two translates in a challenging algorithmic problem I will discuss. Finally, I will talk about our efforts to assign putative function to specific subnetworks by seeking locally-enriched GO gene annotations and sequence motifs.
This project analyzes networks of actors and statements concerning climate change at two levels, the national and the global. Each level has its own two-mode actor-statement network.
The ways that nations frame climate change (from newspaper content analysis) comprises the global field of climate change discourse. Within each nation, different sets of actors support different climate change framings. This project provides the basis for a proposed global real-time national climate change response monitoring and data bank system.
Keywords of the presentation: Internet, wireless network, social network, economic network, education
A new undergraduate course has just been created at Princeton, called "Networks: Friends, Money, and Bytes". It talks about technology, social, and economic networks, and attracts students from engineering, science, and economics departments. The course content cuts across the boundaries of different types of networks without losing domain-specific functionalities. The course's pedagogical approach follows "Just In Time": orienting the entire course around 20 Questions about networked life and only introduces the mathematical languages as needed for each of the questions. I'll discuss what I learned through teaching this course and implications to networking research and curriculum development.
Keywords of the presentation: genetic interactions, epistasis, BPM network motif, local search, maximal cut
One of the most well-studied and useful network motifs found
in genetic interaction data is the between-pathway model (BPM),
introduced first by Kelley and Ideker. This is a network motif
consisting of a particular pattern of genetic and physical
interactions that is thought to signify two coherent sets of genes
that may be compensatory or adaptive. We generalize the BPM motif from
unweighted to weighted networks, allowing us to uncover new BPMs based
on the new high-throughput genetic interaction data becoming available
through the use of EMAP and SGA screening methods. We present a
randomized algorithm based on local search for maximal cuts that is
mathematically natural, algorithmically simple, and fast in practice
that can discover these generalized BPMs. We discuss biological
findings and implications of our approach.
One of the defining properties of small worlds is the prevalence of
short paths connecting node pairs. Unfortunately, as a result the usual
notion of distance is not particularly helpful in distinguishing
neighborhoods in such graphs.
We describe a motivating problem that requires a finer-grained notion of
distance. The problem is quite simple to state: how can any given
network operator in the Internet determine which paths pass through its
network? Surprisingly, the nature of Internet routing makes this
question rather hard to answer.
To address this problem, we define a new distance metric on graph nodes.
This metric has useful and interesting properties: it is easy to compute
and understand, it can be used to sharply distinguish neighborhoods in
networks, and it remains useful even in small-world networks. We show
how we use this metric to address our motivating problem, and more
generally how it can be used for visualization and dimensionality
reduction of complex networks.
Keywords of the presentation: subnetwork markers, predicting chemotherapy response, protein-protein interaction networks
Although molecular profiles of tumor samples have been widely applied to the classification of clinical outcomes, the lack of reproducibility and generalizability has limited use of the clinical use of markers identified in these studies. In particular, finding robust predictors of treatment response has proved to be especially challenging. Recent studies integrating protein-protein interaction (PPI) networks with gene expression profiles have demonstrated that subnetwork (SN) markers offer clear advantages in classification problems, where these SN markers consist of synergistic genes whose aggregate expression discriminate groups of samples. However, existing methods for constructing SNs do not guarantee an optimal solution and have yet to be applied towards predicting individual patient response to drug treatment.
We developed a novel and efficient randomized algorithm to identify optimally discriminative SNs for classification of samples from different classes. Our algorithm is based on a color coding paradigm, which allows for identifying the optimal discriminative SN markers for any given error probability. When the maximum size of a SN marker is k = O(logn) where n is the size of the network, we have a polynomial time algorithm with a fixed error probability. We demonstrated our method on a large published dataset of drug response comprising two independent cohorts of breast patients treated with a chemotherapy regimen. We compared the classification performance of our optimally discriminative SN markers against SN markers identified using heuristics and as well as numerous single gene (SG) markers. On average, the optimally discriminative SN markers show both the best and most stable classification performance in cross-dataset validation. We also show that our subnetwork method produces predictive markers that are more reproducible across independent cohorts and offer valuable insight into biological processes underlying response to therapy.
Being a primary indicator of network health, link traffic volumes are used in multiple network management and diagnostic tasks. Although link volumes are available using off-the-shelf tools, the corresponding measurement records typically contain errors and missing data. To overcome these challenges, the present paper develops a link traffic prediction algorithm that fills missing entries and removes noise from the observed entries in an online fashion. The algorithm not only exploits topological knowledge of the network, but also learns from the available historical link traffic data. During its operational phase, the novel algorithm relies on a sparse signal representation for the link counts over a data-driven dictionary. Prediction of link counts follows after solving an $ell_1$-regularized least-squares problem. Prior to operation however, a dictionary is trained so that it captures all the necessary information from the historical data, allows for a sparse representation, and is aware of the network topology. This is accomplished through a novel semi-supervised dictionary learning scheme which works even when the training data has missing entries. Numerical tests on data from the Internet2 archive corroborate the proposed algorithms.
Keywords of the presentation: graph identification, entity resolution, link prediction, collective classification
The importance of network analysis is growing across many domains, and is fundamental in understanding online social interactions, biological processes, communication, ecological, financial, transportation networks, and more. In most of these domains, the networks of interest are not directly observed, but must be inferred from noisy and incomplete data, data that was often generated for purposes other than scientific analysis. In this talk, I will introduce the problem of graph identification, the process of inferring the hidden network from noisy observational data. I will describe some of the component steps involved, and then I will describe a collective approach to graph identification, which interleaves the necessary steps in the accurate reconstruction of the network.
Joint work with Galileo Namata and Stanley Kok, University of Maryland.
In the backbone of large-scale networks, traffic flows experience abrupt
unusual changes which can result in congestion, and limit the extent
to which end-user quality of service requirements are met. Diagnosing
such traffic volume anomalies is a crucial task towards engineering
the network traffic. This is challenging however, since the available
data are the superposition of unobservable origin-to-destination (OD)
flows per link. Leveraging the low intrinsic-dimensionality of OD flows
and the sparse nature of anomalies, a convex program is formulated to
unveil anomalies across flows and time. A centralized solver is developed
using the proximal gradient algorithm, which offers provable iteration
complexity guarantees. An equivalent nonconvex but separable criterion
enables in-network processing of link-load measurements, when optimized
via the alternating-direction method of multipliers. The novel distributed
iterations entail reduced-complexity local tasks, and affordable message
passing between neighboring nodes. Interestingly, under mild conditions
the distributed algorithm approaches its centralized counterpart.
Numerical tests with synthetic and real network data corroborate
the effectiveness of the novel scheme.
Time plays an essential role in the diffusion of information,
influence and disease over networks. In many cases we only observe
when a node copies information, makes a decision or becomes infected
-- but the connectivity, transmission rates between nodes and
transmission sources are unknown. Inferring the underlying dynamics is
of outstanding interest since it enables forecasting, influencing and
retarding infections, broadly construed.
To this end, we model diffusion processes as discrete networks of
continuous temporal processes occurring at different rates. Given
cascade data -- observed infection times of nodes -- we infer the
edges of the global diffusion network and estimate the transmission
rates of each edge that best explain the observed data. The
optimization problem is convex. The model naturally (without
heuristics) imposes sparse solutions and requires no parameter tuning.
The problem decouples into a collection of independent smaller
problems, thus scaling easily to networks on the order of hundreds of
thousands of nodes. Experiments on real and synthetic data show that
our algorithm both recovers the edges of diffusion networks and
accurately estimates their transmission rates from cascade data.
Keywords of the presentation: Re-identification, Social Networks, Fraud Detection
Re-identification is the process of matching records or behaviors that belong to the same individual, sometimes when the individual is acting anonymously. The ability to re-identify individuals from their social network behavior-their interactions with others on a social network, has many real-world implications in areas such as fraud detection, online target marketing, and author attribution. We considered statistics for re-identication in social network data from three popular network models: Erdös-Rényi, Small World and Scale Free. Many researchers who have worked on these statistical physics models, while cognizant of the inherent stochasticity of the problem, have inadequately addressed statistical estimation and inference. We view re-identication, in this setting, as hypotheses tests of network similarity modulo a network data model. In this paper, we offer a formal statistical framework for re-identication, using first principles and the algorithmic specification of these models. Using our framework, we illustrate the method and its performance on three network data examples: simulations, the Enron emails, and a telecommunications dataset.
Individual genes bear little significance when isolated from their networks of interaction; hence the importance
of networks in modelling biological behaviour of gene products. We devise sophisticated
network alignment algorithms and apply them to biological networks of different species. Our algorithms uncover
huge topologically and functionally conserved regions even across species as
diverse as baker’s yeast and humans. Also, we successfully use network topology to differentiate between cancer and non-cancer
genes: we apply our topological methods to predict novel regulators of melanogenesis from protein-protein interaction (PPI)
networks of human cells and our predictions are phenotypically validated. Also, we begun to apply topology to detect homology, as well
as to determine how “dominating” aging or disease proteins are in the PPI network. Thus, network biology is an
invaluable source of biological information that undeniably has a potential to impact therapeutics and health care.
Keywords of the presentation: scalable network visualization, cancer informatics, integrative computational biology, protein-protein interactions, microRNA:gene interactions
Network visualization tools offer features enabling a variety of analyses to satisfy diverse requirements. Considering complexity and diversity of data and tasks, there is no single best layout, no single best file format or visualization tool: one size does not fit all. A useful tool needs to support interactive visual data exploration, integration of diverse data, rich annotation, query and analysis algorithms, scalability in memory and speed to large networks, user interface that accommodates multiple user types and workflows. Such integrative network analysis will support not only visualization but also visual data mining.Read More...
In this talk we will discuss how finding dense subgraphs can greatly impact our ability to understand network structure.
A major part of plant defense against pathogens consists of an inducible system, in which recognition mechanisms detect pathogens, multiple signaling pathways transduce information, and plant defense is activated. In the model plant, Arabidopsis thaliana, there are three major input signals that activate four critical pathways to mount a defense to a pathogen. These pathways interact in a sophisticated manner, forming a complex signaling network. Our collaborators in Fumiaki Katagiri’s lab (U. of MN, Plant Biology) collected quantitative measurements of both gene expression at multiple time points as well as plant resistance to two different bacterial pathogens in several well-defined genetic mutants with specific defects in these pathways. In total, these data cover all sixteen possible combinatorial perturbations (gene deletions) of the four pathways of interest with each of the three possible input signals (treatments) for both pathogen infections. Based on these data, we successfully built a predictive model of the plant immune system that accurately reflects systematic responses to genetic perturbation and dynamic expression of its components.
We present an interactive Web-based tool for the exploration of the collaborations of jazz musicians. The website allows users to see who collaborated with a particular musician during any desired time range and allows users to interactively follow chains of collaborations. The display positions collaborators according an influence function that depends both the number of recorded collaborations and the time intervals between them. The user can dynamically change the timescale and range on which this influence is calculated.
Keywords of the presentation: network evolution; biological networks; protein interactions
Many questions about present-day interaction networks could be answered by tracking how the network changed over time. We present a suite of algorithms to uncover an approximate node-by-node and edge-by-edge history of changes of a network when given one or more present-day networks. The algorithms use either a plausible probabilistic growth model or an assumption of parsimony. From the reconstructed history, we model the arrival time of extant and ancestral interactions and find that complexes have significantly re-wired over time and that new edges tend to form within existing complexes. We also hypothesize a distribution of per-protein duplication rates, track the change of the network's clustering coefficient, and predict paralogous relationships between extant proteins that are likely to be complementary to the relationships inferred using sequence alone. Finally, we infer plausible parameters for the model, thereby predicting the relative probability of various evolutionary events.Read More...
Joint work with Saket Navlakha, Rob Patro, Emre Sefer, Justin Malin, and Guillaume Marçais.
Physiological and evolutionary properties of genes have been shown to correlate with gene connectivity in the genetic interaction network of Saccharomyces cerevisiae. We extend this observation by showing that the same predictive relationships exist in Schizosaccharomyces pombe, an evolutionarily distant fission yeast. For each of these two yeast species, we build models that successfully predict genetic interaction degree from the gene properties, allowing the identification of expected highly connected genes. Consistency of rules in distant species allows a model trained on one of the yeast species and applied to the other to make predictions with accuracy as high as that of within-species predictions. We compare predicted genetic interaction degrees of S. pombe genes to the degrees of orthologous S. cerevisiae genes. Our approach demonstrates that knowledge of the widely-studied species S. cerevisiae can be used to direct the investigation of genetic interactions in S. pombe and even other distantly related species.
Keywords of the presentation: Network; Sampling Bias; Prediction; Detection
The set of tools for thinking hard about sampling and measurement-level aspects of scientific studies is among the earliest areas of statistics to
be worked out. However, it arguably is not among the sexier topic areas. In network analysis, perhaps partly as a result of this fact, it is not infrequent that we find ourselves inclined to quickly move past the initial sampling and measurement aspects of the problem (i.e, so-called low-level tasks) in our rush to do things like prediction, detection, and the like (i.e., so-called high-level tasks). In this talk I will present results from work in our group on three problems, one each from social, communication, and biological network analysis. In each case, the goal will be to demonstrate how high-level tasks in network analysis can be fundamentally affected by how faithfully we account for the low-level sampling and/or measurement aspects of the science at hand.
Keywords of the presentation: systems biology, protein interaction networks, information theory
In the study of biological systems, many phenotypes, i.e., observed
differences between the behavior of different organisms, are complex -
that is, they are based on a set of complex interactions between
multiple genetic and environmental factors. Genomic association
studies provide significant information on the genetic bases of
complex phenotypes, however so far have defined a limited set of
phenotypes. Large scale monitoring of gene expression reveals
underlying cellular mechanisms, however, it is limited in capturing the
abundance and activity of functional proteins. Protein expression data
provides more reliable information on function, but with limited
coverage. Protein-protein interactions (PPI) highlight functional
relationships between proteins, but PPI data is highly noisy,
incomplete, and static. In this talk, we investigate how these useful,
yet limited sources of biological data can be integrated through
innovative use of computational approaches to gain insights on the
molecular mechanisms of complex phenotypes. We introduce several
algorithmic problems that stem from these applications, including (i)
network-based prioritization of candidate disease genes, (ii)
integration of protein and gene expression data to identify
dysregulated subnetworks in cancer, and (iii) use of subnetwork
markers in prediction of metastasis. Experimental results on various
public and dedicated datasets show that such integrative approaches
can provide significant insights into the systems biology of complex
More and more social interactions are migrating online. In this talk we
study patterns of such interactions in social applications including
email, newsgroups, instant messaging, and micro-blogging. We address the
data mining, modeling, algorithmic challenges that arise as part of
Keywords of the presentation: network community detection, graph clustering, graph partitioning
Networks are a powerful way to describe and represent social, technological and biological systems. Such systems can all be studied as graphs, where nodes represent entities (e.g., people, web sites) and edges represent interactions (friendships, communication, links). Nodes in networks organize into groups that are commonly referred to as communities, clusters or modules. This is arguably the most important and useful resolution for studying networks as there are many reasons why networks organize into communities. For example, society is organized into groups, families, friendship circles, villages and associations. On the World Wide Web, topically related pages may link more densely among themselves. Therefore, the task of network community detection is to discover such organizational units given an unlabeled network.
Understanding of how network communities are organized in networks has evolved over time. The traditional view of communities was that edges appear with high concentration within the community, and low concentration between these communities. Recently models of overlapping communities, i.e., nodes
can belong to more than one community, have been proposed. The current view of how communities overlap assumes that community overlaps are less densely connected than the groups themselves, i.e., the the red nodes in the overlap are less likely to be connected than pairs of nodes that belong to the same single
Our work directly challenges these assumptions. We study a set of diverse networks where nodes explicitly state their community memberships. We discover that community overlaps are in fact denser than the non-overlapping parts, which is in sharp contrast with the present models of community overlaps. Based
on these observations we develop the model which reliably captures the structure and overlaps of communities in networks. We then develop a model based community detection method and evaluate it on six large networks. Our experiments demonstrate that the proposed community detection method outperforms the
present state of the art methods.
Keywords of the presentation: social networks, information diffusion
Although the continuous circulation of information, news, jokes,
and opinions is ubiquitous in the worldwide social network, the
actual mechanics of how any single piece of information spreads
on a global scale have largely remained mysterious. A major
challenge lies in the difficulty of acquisition of large-scale
data recording the diffusion of any particular single piece of
information. This talk will focus on recent work tracing such
processes through online traces of information spreading. I will
survey a variety of computational (and not so computational)
efforts to trace diffusion through networks. I'll then focus
recent joint work with Jon Kleinberg and and Flavio Chierichetti
on the reconstruction of the propagation of massively circulated
Internet chain letters, like those opposing the start of the Iraq
war, Women Against Sarah Palin, protests of funding cuts for
NPR/PBS, among others. The propagation of these petitions does
not fan out rapidly in the style of a small-world epidemic, but
rather produces a narrow but very deep tree-like pattern. This
observation suggests a new and more complex picture for the
spread of information through a social network.
Joint work with Jon Kleinberg and and Flavio Chierichetti.
Recent research has revealed many remarkably robust structural
properties of social networks: triadic closure, heavy-tailed degree
distributions, and small-world phenomena, to name just a few. Of
course, the ``why'' of these properties has generally been more
elusive. In this talk, I will present some results from a recent
collaboration with evolutionary psychologists and computer scientists
on questions of how people choose friends and prioritize among those
friends. Specifically, I will describe analysis of large sample of
MySpace profiles containing ``Top Friends'' lists, in which an
individual selects a small subset of his or her friends and organizes
them into a ranked order of that individual's choice. Different
classes of behavioral hypotheses give rise to very different
graph-theoretic structures in the best-friend network, and we can use
these ranking data to provide supporting evidence for some of these
Joint work with Peter DeScioli, Robert Kurzban, and Elizabeth Koch.
Ant colonies are known for their complex and efficient social organization that completely lacks hierarchical structure. However, little is known about how workers within a colony organize themselves. In this study, we tracked the position and orientation of all individually marked ants in six colonies for 41 days. We inferred interactions between ants by modeling each ant as a trapezoid and testing whether the head end on one ant was in the trapezoid of the other. For each day and each colony, we built a network by cumulating all interactions. To test whether the network had a community structure, we assigned each ant to a cluster with the infomap algorithm , and then derived the group membership for each ant. Workers were organized in three distinct but interconnected groups that persist over at least 10 days. Each group specialized in a different behavioral task: nursing, nest patrolling and rubbish collection, foraging. Spatial heatmaps further showed that each group occupied a specific area in the nest, and that workers of the same groups interacted preferentially in their group specific area, whereas workers of different groups interacted elsewhere in the nest. Analysis of the evolution of worker group membership showed that workers have a preferred behavioral trajectory, moving from nursing to nest patrolling, and from nest patrolling to foraging. Our results show that workers in ant colonies are organized in functional groups each of which specializes in a given task, which suggests that division of labor is embodied in the social organization of a colony. We further show that groups occupy different spatial positions suggesting that spatial structure is an important component in colony organization. Together these results emphasize that ant colonies rely on strong underlying organizational principles, and that the social structure is a relevant and defining characteristic of ant colonies.
 Rosvall M. & Bergstrom C.T. (2008) Maps of randoms walks on complex networks reveal community structure. PNAS 105: 1118-1123
Keywords of the presentation: Computational and systems biology, complex networks, protein-protein interaction networks, network topology, protein function prediction, disease gene and drug target identification
Prediction of protein function and the role of protein pathways in disease from protein-protein interaction (PPI) networks have received much attention in the post-genomic era. Given a limited functional knowledge about some proteins in a network, the challenge of computational and systems biology is to functionally characterize other proteins based on their network connectivity information. For this, an efficient method for capturing proteins' complex interaction patterns in the network is needed. We devise such a mathematically rigorous method that summarizes complex wirings around a protein in the network and compares the topological similarity of proteins' extended network neighborhoods. We use this method to predict protein function and imply involvement of genes (i.e., their protein products) in cancer. Specifically, using many clustering approaches, we group together "topologically similar" proteins in the human PPI network. The resulting clusters are enriched in biological function: the more topologically similar the extended network neighborhoods of proteins, the more likely the proteins are to perform a same biological function. The resulting clusters are also enriched in cancer genes: cancer and non-cancer genes have different network topological signatures. Also, we address an open debate about whether genes involved in key biological processes exhibit some "topological centrality" compared to the rest of the proteins in the human PPI network, since changes in their activity are likely to affect the activity of many other proteins. Indeed, we find that genes involved in aging, cancer, infectious diseases, or signaling and drug-targeted pathways occupy topologically complex and dense regions of the network and correspond to its "spine" that "dominates" other genes in the network, i.e., connects all other network parts, and can thus pass cellular signals efficiently throughout the network. Hence, network topology is an invaluable source of biological information that can suggest novel drug targets for therapeutics intervention.Read More...
Recent developments in experimental technology have enabled the rapid construction of combinatorial genetic perturbations in a number of different model organism. Interestingly, certain combinations of harmless single perturbations can result in dramatic phenotypes, suggesting built-in network redundancy is a ubiquitous property of cellular systems. While the interactions derived from these large-scale screens have proven to be highly informative for mapping gene function, many challenges remain in the specific interpretation of genetic interactions and their relation to other genomic features.
In this talk, I will describe our recent efforts to understand the results of large-scale genetic interaction screens in the context of the model organism yeast. In collaboration with a yeast genetics lab, we have measured quantitative phenotypes for millions of double deletion mutants. I will address the general question of how we can learn systems-level biology from these experiments and demonstrate their utility for characterizing global cellular function and organization. Finally, I will highlight several open problems in the interpretation of genetic interaction networks and discuss where innovations in machine learning and data mining are particularly relevant.
Recently there has been a growing interest in representing data from complex systems as graphs and analyzing the network structure to understand key patterns/dependencies in the underlying system. This has fueled a large body of research on both models of network structure and algorithms to automatically discover patterns (e.g., communities) in the structure. However, robust statistical models, which can accurately represent distributions over graph populations, are critical to to accurately assess the significance of discovered patterns or to distinguish between alternative models. Specifically, since sampling distributions (either analytical or empirical) can be used to determine the likelihood of a given sample, statistical models facilitate hypothesis testing and anomaly detection (e.g., graphs with low likelihood can be flagged as anomalous). However, unlike metric spaces the space of graphs exhibits a combinatorial structure which poses significant theoretical and prac
tical challenges which need to be overcome for accurate estimation and efficient inference. In this talk, I will discuss recent work in which we investigated the distributional properties of state-of-the art generative models and sampling algorithms for graphs, and discuss the implications of the findings for network hypothesis testing and anomaly detection.
Keywords of the presentation: systems biology, protein interactions, metabolic networks, gene regulation, evolution, phylogenetics, statistical inference
The field of systems biology has emerged from a confluence of technological advances (DNA sequencing, gene expression profiling, proteomics, metabolomics etc.) and a “systems-level” understanding of biological processes, supported by network theory. Network models are essential in the interpretation of experimental results and, increasingly, in predicting the behaviour of cellular systems subject to experimental perturbations.
One of the most challenging aspects of theoretical systems biology lies in the accurate reconstruction of biological networks. Since biological data is strongly affected by noisy measurements and incomplete and biased sampling, it can be difficult to obtain networks of sufficient accuracy to support predictive modelling.
Fortunately, our understanding of the genetic mechanisms responsible for network change during evolution can provide a means to integrate the observed data from multiple species. I am interested in exploring the evolutionary history of these networks and studying the processes that have shaped their structures. In addition, the same inference methods can be used to improve network reconstructions for present-day species, contributing to better models for systems biology applications.
Systems biology is often concerned with the representation of cellular interactions and their associated biological systems as networks. Substantial amounts of high quality interaction data are now available, obtained from large-scale ("high-throughput") experiments or curated from multiple small-scale experiments published in the scientific literature.
Network analyses rely upon accurate data to draw robust and meaningful conclusions: one simple example is the relationship between the degree of a protein and its physical, functional or evolutionary properties. Given the targeted nature of the majority of the available interaction data, it is clearly feasible that biases in these networks may result in erroneous conclusions. Indeed, protein interaction data sets from different sources return seemingly statistically significant but mutually incompatible results for many hypothesis tests.
To compensate for this effect, we have devised a novel method to evaluate and correct for ascertainment biases arising from various sources. Application of this method to multiple protein interaction networks leads to an increase in agreement between data sources and hence more robust biological conclusions. We envisage this approach being of use in a wide range of network analyses where bias is known to be a confounding factor.
The intricate structure of many large-scale networked
systems has attracted the attention of the scientific
community, leading to many results attempting to explain
the relationship between the structure of the network and node dynamics.
A common approach to study the structure is to use synthetic network models in which structural properties of interest, such as degree distributions,
are prescribed. Although very common, this approach can have a major flaw: Synthetic network models implicitly induce many structural properties that are not directly controlled and can be relevant to the network performance. Therefore, it is difficult to draw conclusions about the role of a particular structural property in a real network using synthetic models. In this
talk, I will present an alternative view where instead of prescribed random models, I use algebraic graph theory and convex optimization to study how structural properties constrain those global network invariants that effects dynamics and performance(e.g. eigenvalues of adjacency and Laplacian matrices). Using examples such as dynamics of epidemic spreading on large social networks and synchronization on the power grid as examples, I will discuss examples we will see when "macroscopic" local measurements such as degree distributions and clustering constrain the eigenvalues and when they do not.Read More...
joint work with Victor Preciado
Keywords of the presentation: biological networks, graph algorithms, network alignment
Sequence-based computational approaches have revolutionized biological understanding. However, they can fail to explain some biological phenomena. Since proteins aggregate to perform a function instead of acting in isolation, the connectivity of a protein interaction network (PIN) will provide additional insight into the inner working on the cell, over and above sequences of individual proteins. We argue that sequence and network topology give insights into complementary slices of biological information, which sometimes corroborate each other, but sometimes do not. Hence, the advancement depends on the development of sophisticated graph-theoretic methods for extracting biological knowledge purely from network topology before being integrated with other types of biological data (e.g., sequence). However, dealing with large networks is non-trivial, since many graph-theoretic problems are computationally intractable, so heuristic algorithms are sought.
Analogous to sequence alignments, alignments of biological networks will likely impact biomedical understanding. We introduce a family of topology-based network alignment (NA) algorithms, that we call GRAAL algorithms, which produces by far the most complete alignments of biological networks to date: our alignment of yeast and human PINs demonstrates that even distant species share a surprising amount of PIN topology. We show that both species phylogeny and protein function can be extracted from our topological NA. Furtermore, we demonstrate that the NA quality improves with integration of additional data sources (including sequence) into the alignment algorithm: surprisingly, 77.7% of proteins in the baker’s yeast PIN participate in a connected subnetwork that is fully contained in the human PIN suggesting broad similarities in internal cellular wiring across all life on Earth. Also, we demonstrate that topology around cancer and non-cancer genes is different and when integrated with functional genomics data, it successfully predicts new cancer genes in melanogenesis-related pathways.
Keywords of the presentation: genotype-phenotype association, biological networks, netwoks dys-regulation in diseases
New experimental techniques facilitating genome-wide measurements of various molecular quantities provide us with an unprecedented opportunity to gain new insights into functioning of cellular systems and help explaining the relation between genotype and phenotype. In particular, we would like to understand how the genotypic changes are propagated along molecular pathways.
On one hand, in complex diseases, different genotypic perturbations often lead to the same disease phenotype presumably dys-regulating the same pathways of the cellular system. On the other hand, there are numerous examples where the impact of a large genotypic change such as gene copy number variations appears to be “buffered” and has no apparent phenotypic effect.
I will discuss systems level approaches, combining various types of experimental data, statistical data analysis, and new algorithmic techniques that we developed in the quest to address these questions.
We consider the problem of monitoring, tracking and predicting network-wide path performance metrics such as end-to-end delay and loss rates in IP networks. Classical prediction tools do not take full advantage of the inherent spatio-temporal correlations that performance metrics manifest in practice. Tapping into the spatio-temporal kriging theory, we put forth a dynamic network kriging approach with real-time network-wide prediction capabilities based on measurements acquired for a small subset of network paths. Going well beyond state-of-the-art methods, the proposed model captures not only spatio-temporal correlations but also unmodeled dynamics due to, e.g., congested routers. The framework also enables choosing optimal measurement locations in an online fashion.
Keywords of the presentation: genetic interaction, drug interaction, synergy, antagonism
Sometimes two mutations together cause a phenotype that is surprising given the effects of individual mutations. This phenomenon defines genetic interaction. Similarly, one drug can synergize with or antagonize the effects of another. I will describe early efforts at systematically mapping the network of interactions between antifungal chemical agents.
Keywords of the presentation: network dynamics, modeling, prediction, terrorism
Despite domain differences, cancer researchers and insurgency analysts share real-world constraints in common: (1) the systems they seek to understand elude accurate and complete measurement and (2) the mechanisms that influence cancer and insurgent behavior are not well-understood. In spite of these challenges, both communities must devise strategies for containing, managing, and disrupting their respective systems.
Models of biochemical networks are constructed from and trained on experimental data. Because the quality of the experimental data used can significantly influence the quality the model obtained, practical issues such as equipment measurement limitations, incompletely characterized off-target effects of reagents, as well as the inherent variability of living systems can hurt efforts to build accurate models. In response to this fact, we have developed methods for building mathematical and computational models of cellular network dynamics by using the trends, rather than the exact measurements contained within experimental results as training data. Our methods can use sparse and noisy data to build models with 85% accuracy, beating all other known methods.
Models of the effect of counter-insurgency operations on violent non-state actors are often modeled as causal networks in which the variables take on qualitative values (e.g., "high", "strong", and "not present"). We have had success in applying our methods developed for biochemical networks to building predictive models of the effect of counter-insurgency strategies.
In this presentation, we will discuss the qualitative modeling methods we developed in both the biological and social contexts. We will show how they have been successfully applied to predicting the effect of perturbations to cancer cell signaling networks as well as to assessing how different counter-insurgency strategies may affect the long-term stability of Afghanistan.
Keywords of the presentation: systems biology, protein interactions, subnetwork inference
In recent years, there is a tremendous growth in large scale data on human proteins, their interactions, and their relations to diseases. These allow for the first time a systems-level analysis of the molecular basis of disease. In my talk I will describe several recent works in this direction, aiming to uncover novel disease proteins and their underlying pathways with implications to diagnosis and therapy.
Proteins accomplish virtually all of their cellular functions via interactions with other molecules. High-throughput experimental technologies, along with computational
predictions, have resulted in large-scale protein interaction networks for numerous organisms. What can we learn about biological systems from these networks? In my talk, I will discuss a number of approaches we and others have developed for analyzing these networks in order to uncover cellular organization principles, and to reveal protein functions and pathways. I will focus on the increasing need to perform "attribute-driven" analysis, where systems-level measurements and knowledge about proteins and interactions are incorporated into computational methodologies.
Group size is a fundamental determinant of the organization and functioning of many collective processes in humans, animals and in artificial systems. Human companies, animal societies and computer networks thus face similar challenges when they fluctuate in size, to maintain effective functioning and ensure ongoing performance of the system, i.e. to sustain system stability ("homeostasis"). In certain ant species, colonies undergo an annual cycle of fragmentation into several nests in early spring ("fission"), followed by coalescence into a single nest at the end of summer ("fusion"), a phenomenon known as “seasonal polydomy”. Using a cutting-edge system that allows the automated tracking of several hundreds of individuals simultaneously for extensive periods of time, we propose to investigate how abrupt variation in nest size caused by fission and fusion events affects social organization in the seasonally polydomous ant Camponotus kiusiuensis. We will focus especially on division of labor among workers and on the structure, functioning and resilience of interaction networks in the colony. We also aim to identify the behavioral strategies allowing the maintenance of homeostasis at both the individual and the collective levels. We present an overview of the project and the main expected challenges.
Social networks generate a large amount of text content over
time because of continuous interaction between participants. The
mining of such social streams is more challenging than traditional
text streams, because of the presence of both text content and
implicit network structure within the stream. The problem of event
detection is also closely related to clustering, because the events
can only be inferred from aggregate trend changes in the stream. In
this paper, we will study the two related problems of clustering and
event detection in social streams. We will study both the supervised
and unsupervised case for the event detection problem. We present
experimental results illustrating the effectiveness of incorporating
network structure in event discovery over purely content-based
Keywords of the presentation: Information visualization, Web, Blog, Twitter
The growth of the Web and social media allows us to observe real world events almost in real time. We are building a system for analyzing the societal behavior based on Web archives. Our Web archive consists of 20 billion URLs crawled for 11 years together with blog and microblog datasets. In this talk, I will show some case studies based on visual analytics of dynamic networks, including the information diffusion in microblogs under the 3.11 earthquake, and the structure of search engine spams.
Duplicate genes have been shown to exhibit
fewer genetic interactions (digenic) than
expected. This may be due to pair level
buffering effects. Essentially, one copy of the
pair can substitute for the other as a backup
when the other is deleted. Any function that is
buffered in this way will not show any genetic
interactions unless we knock out the backup
as well. To this end we have constructed SGA
query strains which are missing not one but
two genes. Gene pairs selected are paralogs
(duplicate copies) primarily from the Whole
Genome Duplication even thought to have
taken place in yeast ~ 100 MYA.
Keywords of the presentation: cancer genomics; pathways; diffusion model; false discovery rate;
Cancer is a disease that is driven by somatic mutations that accumulate in the genome during an individual’s lifetime. Recent advances in DNA sequencing technology are enabling genome-wide measurements of these mutations in numerous cancers. A major challenge in analyzing this data is to distinguish functional mutations that drive cancer progression from “passenger” mutations not related to the disease. Recent cancer sequencing studies have shown that somatic mutations are distributed over a large number of genes. This mutational heterogeneity is due in part to the fact that somatic mutations target cellular signaling and regulatory pathways, and that a mutation in dozens of possible genes might be sufficient to perturb a pathway. While some of these pathways are well characterized, many others are only approximately known. Standard pathways analysis methods test the enrichment in somatic mutations of know pathways, reducing the ability to identify novel sets of mutated genes or crosstalk between mutated pathways.
I will describe HotNet, an algorithm to identify subnetworks of an interaction network that are mutated in a significant number of cancer genomes. HotNet models mutations as heat sources and employs a heat diffusion process on the interaction network to find “hot subnetworks.” The heat diffusion process is efficiently computed using the Laplacian matrix of the network, and considers both the frequency of mutation of the genes and the topology of the network. We also derive a statistical test to rigorously assess whether the number of hot subnetworks is significant under a suitable null hypothesis, and to bound the false discovery rate (FDR) of the hot subnetworks identified by HotNet. I will illustrate applications of these algorithms to data from The Cancer Genome Atlas, a project that is characterizing the genomes of thousands of samples from dozens of cancer types.
Models based on mass action kinetics are widely used but, in a strict sense, limited to biochemical reactions in dilute solution, where reactants freely diffuse and react in an unobstructed space. Modeling diffusion-reaction kinetics in a crowded environment, such as the cytoplasm, requires fractal-like ordinary differential equation (ODE) models. In particular, generalized mass action systems have been proposed and successfully validated for this purpose. In this work, we establish two novel, particle-based methods to simulate biochemical diffusion-reaction systems within crowded environments. We distinguish two conceptually different situations.
Keywords of the presentation: big data, data quality, measurements, network models, Internet connectivity
Network science prides itself on the fact that many of the new ideas that
have contributed to its enormous popularity are based on (big) data and meticulous observations. Unfortunately, an inconvenient truth about most
large-scale, real-world, and highly-engineered or highly-evolved systems is
that the many things we can and do measure about these systems are generally
not the quantities we want to measure. Largely unwilling to deal with this
inconvenient truth, network science has more or less succeeded in making this discrepancy a non-issue by labeling scientific tasks such as assessing data quality or ensuring data hygiene as unnecessary, small-minded, and even
counter-productive for scientific discovery.
The Internet is a prime example of a large-scale and highly-engineered
real-world system where the available observations are everything but meticulous
and where ignoring this inconvenient truth has resulted in new models, theories,
and predictions that - despite their general appeal and headline-grabbing
nature - have nothing to do with reality and quickly collapse when scrutinized with carefully vetted measurements and readily available domain knowledge. Using a number of widely-used and easily-available datasets of different
types of Internet connectivity measurements, I will illustrate in this talk
what it means to get to know your data (i.e., assessing its quality) and to develop a network science that is serious about big data and adamant about
its proper use for scientific discovery.
Internet traffic matrices allow for a compact description of the traffic
that traverses the Internet at different levels of abstractions. For example, the
intra-domain traffic matrix of a large ISP like AT&T describes the hourly or daily
traffic volumes that are exchanged between its ingress- and egress PoPs and
traverse a physical infrastructure that is relatively stable. The problem of inferring
this traffic matrix has morphed from an active research topic with important
practical applications into a non-problem -- many of today's networks are now
instrumented to measure the traffic matrix directly, without any need for inference
or estimation. However, a new and more challenging type of traffic matrix has
recently emerged - the Internet's inter-domain traffic matrix that describes the
hourly or daily traffic volumes exchanged between the various networks that
define the Internet as a "network of networks." In this talk, I will discuss some
of the challenges that arise when working with this matrix including the problem
formulation itself, potential solution methods, and the all-important validation
Models based on mass action kinetics are widely used but, in a strict sense, limited to biochemical reactions in dilute solution, where reactants freely diffuse and react in an unobstructed space. Modeling diffusion-reaction kinetics in a crowded environment, such as the cytoplasm, requires fractal-like ordinary differential equation (ODE) models. In particular, generalized mass action systems have been proposed and successfully validated for this purpose. In this paper, we establish two novel, particle-based methods to simulate biochemical diffusion-reaction systems within crowded environments. We distinguish two conceptually different situations. In the first, the ODEs capture a microscopic “reaction-only” mechanism, while diffusion is modeled separately. In the second case, the ODEs model the combined effects of both reaction and diffusion. This distinction consequently leads to two simulation methods that both effectively simulate and quantify crowding effects, including reduced reaction volumes, reduced diffusion rates, and reduced accessibility between potentially reacting particles. The proposed methods account for fractal-like kinetics, where the reaction rate depends on the local concentrations of the molecules undergoing the reaction. Rooted in an agent based modeling framework, this aspect of the methods offers the capacity to address sophisticated intracellular spatial effects, such as macromolecular crowding.