May 7 - 11, 2012
Keywords of the presentation: Peer Effects, Causal Inference, Identification, Econometrics, Experimental Methods
Many of us are interested in whether "networks matter." Whether in the
spread of disease, the diffusion of information, the propagation of
social contagions, the effectiveness of viral marketing, or the
magnitude of peer effects in a variety of settings, a key problem is
understanding whether and when the statistical relationships we
observe can be interpreted causally. Sinan will review what we know
and where research might go with respect to identifying causal peer
influence in social networks and the importance of causal inference
for policy. He will provide three examples from large scale
observational and experimental studies in online social media
The literature in biology is vast and rich with valuable empirical, heuristic and theoretical information. Systematic organization and knowledge mining of research articles play a major role in helping with large-scale project formulation, making groundbreaking discoveries and forming novel hypotheses.
The following is a preliminary report on development of “Big Data” analysis tools tailored for systems biology, such as dynamic models of -omic networks, systems-level perturbation of pathways etc. For example, we demonstrate utility of Natural Language Processing (NLP) in discovery of hidden and implicit correlations among pairs of genes in massive gene expression data for diurnal and circadian rhythms in wild type Arabidopsis thaliana (provided by the Chory Lab). A theory of Collective Intelligence in Plant Biology (PhytoCognito) is under development that provides the synthesis of cognitive, computational and informatics that will sustain collaborative, user-centered efforts that aim at continuing the heritage of the past into the major successes of the future and scientific breakthroughs. One of the most important goals of genomic research is to extract functional information from gene expression time series data. Thanks to DNA microarray development, mRNA sampling of thousand genes is now possible by using a single chip. This technology has made it possible to measure gene expression of whole genome over and over again to explore the model response of an organism to a change in condition, e.g., application of some drug or other treatment.
Keywords of the presentation: Computational biology, social network analysis, image processing, machine learning
Computation has fundamentally changed the way we study nature.
Recent advances in data collection technology, such as GPS and other
mobile sensors, high definition cameras, satellite images, and
genotyping, are giving biologists access to data about the natural
world which are orders of magnitude richer than any previously
collected. Such data offer the promise of answering some of the big
questions about why animals do what they do, among other things.Read More...
Unfortunately, in this domain, our ability to analyze data lags
substantially behind our ability to collect it. In this talk I will show
how computational approaches can be part of every stage of the
scientific process of understanding animal sociality, from data
collection (identifying individual zebras from photographs by stripes) to
hypothesis formulation (by designing a novel computational framework for
analysis of dynamic social networks).
The study of the spread of influence through a social network has a long history in the social sciences. The first studies focused on the adoption of medical and agricultural innovations, later marketing researchers investigated the "word-of-mouth" diffusion process as an important mechanism by which information can reach large populations, possibly influencing public opinion, driving new product market share and brand awareness. Recently, thanks to the success of on-line social networks and microblogging platforms such as Facebook and Twitter, the phenomenon or influence exerted by users of an online social network on other users and how it propagates in the network, has attracted the interest of computer scientists and IT specialists.
One of the key problems in this area is the identification of influential users, by targeting whom certain desirable outcomes can be achieved. Here, targeting could mean giving free )or price discounted) samples of a product and the desired outcome may be to get as many customers to buy the product as possible.
In this talk we take a data mining perspective and we discuss what (and how) can be learned from available traces of past propagations. While doing this we provide a brief survey of some recent progresses in this area, as well as discuss the open problems.
Keywords of the presentation: Human mobility patterns, Call Detail Records
An improved understanding of human mobility patterns can help answer key questions in fields as varied as mobile computing, urban planning, ecology, and epidemiology. Cellular telephone networks can shed light on human movements cheaply, frequently, and on a large scale. We have developed techniques for analyzing anonymous cellphone locations to explore how large populations move in metropolitan areas such as Los Angeles and New York. Our results include measures of how far people travel each day, estimates of carbon footprints due to home-to-work commutes, density maps of the residential areas that contribute workers to a city, and relative volumes of traffic on commuting routes. We have validated our approach through comparisons against ground truth from volunteers and against independent sources such as the US Census Bureau. Throughout our work, we have taken measures to preserve individual privacy. This talk presents an overview of our methodologies and findings.
The Google's PageRank eigenvector is sparse in the sense that most elements are extremely small. Basis Pursuit De-Noising is a reasonable tool for finding the tiny proportion of significant nonzeros.
This is joint work with Michael Saunders.
In many online advertising campaigns, multiple vendors, publishers or search engines (herein called channels) are contracted to serve advertisements to internet users on behalf of a client seeking specic types of conversion. In such campaigns,individual users are often served advertisements by more than one channel. The process of assigning conversion credit to the various channels is called "attribution," and is a subject of intense interest in the industry. This work presents a causally motivated methodology for conversion attribution in online advertising campaigns. We first propose a need for the standardization of attribution measurement and offer four principles upon which standardization may be based. Stemming from these standards, we offer an attribution solution that generalizes prior attribution work in cooperative game theory and recasts the prior work through the lens of a causal framework. We argue that in cases where causal assumptions are violated, our solution can be interpreted as a variable (or channel) importance measure. Finally, we present a practical solution towards managing the potential complexity of the generalized attribution methodology, and show examples of attribution measurement
on several online advertising campaign data sets.
Keywords of the presentation: peer effects, social influence, causal inference, encouragement designs, experimentation
Peer effects can produce clustering of behavior in social networks, but so can homophily and common external causes. For observational studies, adjustment and matching estimators for peer effects require often implausible assumptions, but it is only rarely possible to conduct appropriate direct experiments to study peer influence in situ.
We illustrate the limitations of observational analysis with a constructed observational study that allows us to compare experimental and observational estimates of peer influence in link sharing via Facebook News Feed.
We describe research designs in which individuals are randomly encouraged to perform a focal behavior, which can subsequently influence their peers. Ubiquitous product optimization experiments on Internet services can be used for these analyses. This approach is illustrated with an analysis of peer effects in expressions of gratitude via Facebook on Thanksgiving Day 2010, with implications for the micro-foundations of culture.
Keywords of the presentation: mobile health, participatory sensing
The most significant health and wellness challenges increasingly involve chronic conditions, from diabetes, hypertension, and asthma to depression, chronic-pain, sleep and neurological disorders. And three lifestyle behaviors contribute to many of these conditions. Participatory mobile health (mHealth) leverages the power and ubiquity of mobile and cloud technologies to assist individuals, clinicians, and researchers in monitoring, managing, and understanding symptoms, side effects and treatment outside the clinical setting; and to address the lifestyle factors that can bring on or exacerbate these conditions. By empowering individuals to track and manage their key health-related behaviors and outcomes, this approach has the potential to greatly improve people’s health and quality of life, while simultaneously reducing societies’ overall healthcare costs.
Participatory mHealth incorporates a variety of techniques, including automated activity traces, reminders and prompted inputs. This talk will present our experience to date with mHealth pilots and prototypes and will discuss areas in need of exploration: open modular tools for data collection, analysis and visualization across diverse data types; engagement such as adaptive goal setting and game mechanics; and privacy mechanisms.
Keywords of the presentation: Discussion boards, social networks, drug surveillance, data mining
Medical message boards are online resources where users with a particular condition exchange information, some of which they might not otherwise share with medical providers. Many of these boards contain a large number of posts and patient opinions and experiences that would be potentially useful to clinicians and researchers. We present an approach that is able to collect a corpus of medical message board posts, de-identify the corpus, and extract information on potential adverse drug effects discussed by users. Using a corpus of posts to breast cancer message boards, we identified drug event pairs using co-occurrence statistics. We then compared the identified drug event pairs with adverse effects listed on the package labels of tamoxifen, anastrozole, exemestane, and letrozole. Of the pairs identified by our system, 75–80% were documented on the drug labels. Some of the undocumented pairs may represent previously unidentified adverse drug effects.
Keywords of the presentation: demographics, networks, temporal models, social, web
This talk provides an overview of several recent projects in modeling
social data, including user demographics, social network structure,
and temporal behavior. First, we present a study which pairs browsing
histories for 250,000 anonymized individuals with user-level
demographic data to study variation in Web activity among different
demographic groups. Next, we discuss work with the Yahoo! Mail team
which aims to infer associations and groups amongst and individual's
contacts. We conclude with an interpretable temporal model of
communication patterns which, phrased as a hidden Markov model,
provides an effective and interpretable characterization of both human
and non-human activity.
Keywords of the presentation: social networks, linking, privacy, matching, labeling, visualization
Many machine learning problems on data can naturally be formulated as problems on graphs. For example, dimensionality reduction and visualization are related to graph embedding. Given a sparse graph between N high-dimensional data nodes, how do we faithfully embed it in low dimension? We present an algorithm that improves dimensionality reduction by extending the Maximum Variance Unfolding method. But, given only a dataset of N samples, how do we construct a graph in the first place? The space to explore is daunting with 2^(N(N-1)/2) graphs to choose from yet two interesting subfamilies are tractable: matchings and b-matchings. By placing distributions over matchings and using loopy belief propagation, we can efficiently infer maximum weight subgraphs. These fast generalized matching algorithms leverage integral LP relaxations and perfect graph theory. Applications include graph reconstruction, graph embedding, graph transduction, and metric learning with emphasis on data from text, network, mobile and image domains.
In sequential event prediction, we are given a "sequence database" of
past sequences to learn from, and we aim to predict the next event
within a current event sequence. We focus on applications where the
set of past events has predictive power and not the specific order of
those past events. Such applications arise in recommender systems,
equipment maintenance, medical informatics, and in other domains. Our
formalization of sequential event prediction draws on ideas from
supervised ranking. We show how specific choices within this approach
lead to different sequential event prediction problems and algorithms.
We apply our approach to an online grocery store recommender system as
well as a novel application in the health event prediction domain.
Keywords of the presentation: observational study, predictive modeling, healthcare
In our data-rich world, key medical decisions, ranging from a regulator’s decision to curtail a drug to patient-specific treatment choices, require optimal consideration of myriad inputs. Statistical/epidemiological methods that can harness real-world medical data in useful ways do exist, but much work remains to achieve the full potential of a truly data-driven user-centric medical environment. I will lay out some of the key challenges before us and describe recent progress in the specific area of drug safety.
We propose an online binary classification procedure for cases when
there is uncertainty about the model to use and
parameters within a model change over time.
We account for model uncertainty
through Dynamic Model Averaging (DMA), a dynamic extension of
Bayesian Model Averaging (BMA) in which
posterior model probabilities may also change with time. We
apply a state-space model to the parameters of
each model and we allow the data-generating model to change over time according to a Markov chain. Calibrating a ``forgetting'' factor accommodates
different levels of change in the data-generating mechanism. We propose an algorithm which
adjusts the level of forgetting in an online fashion using
the posterior predictive distribution, and so accommodates various levels of change at different times.
We apply our method to data from children with appendicitis who receive either a traditional (open) appendectomy or a laparoscopic procedure. Factors associated with which children receive a particular type of procedure changed substantially over the seven years of data collection, a feature that is not captured using standard regression modeling. Because our procedure can be implemented completely online, future data collection for similar studies would require storing sensitive patient information only temporarily, reducing the risk of a breach of confidentiality.
This is a joint work of Tyler H. McCormick (University of Washington), Adrian E. Raftery (University of Washington), David Madigan (Columbia University), Randall Burd (Children's National Medical Center)
Keywords of the presentation: recommender systems, context-awareness, collaborative tagging
The role of recommender systems as a fundamental utility for electronic commerce and information access is well established with many commercially-available recommender systems providing benefits to both users and businesses. But, recommender systems tend to use simplistic user models that are additive in nature: new user preferences are simply added to the existing profiles. This additive approach ignores the notion of "situated action," that is, the fact that users interact with systems within a particular context and items relevant within one context may be irrelevant in another. Little agreement exists among researchers as to what constitutes context, but its importance seems undisputed. In psychology, a change in context during learning has been shown to have an impact on recall. Research in linguistics has shown that context plays the important role of a disambiguation function. More recently, the role of context has been explored in intelligent information systems. In particular, a variety of approaches and architectures have emerged for incorporating context or situational awareness in the recommendation process. In this talk, we provide a broad overview of the problem of contextual recommendation and some of the recent solutions to the problem of modeling context. We will specifically focus on several approaches for integrating context in user modeling for personalized recommendation, including an approach inspired by a model of human memory and emphasizes the modeling of context based on observations of user behavior; another that emphasizes the role of domain knowledge and semantics as an integral part of user context, and finally, an approach that exploits social annotations, such as collaborative tagging, as the basis for inferring content.
Cathy will talk about doing math in business, specifically drawing on her experiences as an assistant professor in math, as a quant at a hedge fund, and currently as a data scientist at an internet advertising startup. She will discuss the mathematical as well as the cultural differences of the three jobs, and will suggest how to decide where one may best fit in and why. She will also talk about how questions of ethics fit in to the daily life of a mathematician in business.
I am working on the design of predictive models that are both accurate and interpretable by a human. These models are built from association rules such as "dyspepsia & epigastric pain -> heartburn." I will present three algorithms for "decision lists," where classification is based on a list of rules:
1) A very simple rule-based algorithm, which is to order rules based on the "adjusted confidence." In this case, users can understand the whole algorithm as well as the reason for the prediction.
2) A Bayesian hierarchical model for sequentially predicting conditions of medical patients, using association rules.
3) A mixed-inter optimization (MIO) approach for learning decision lists. This algorithm has high accuracy and interpretability - both owing to the use of MIO.
This is joint work with David Madigan, Tyler McCormick, Ben Letham, Allison Chang, and Dimitris Bertsimas.
Keywords of the presentation: personalized predictions, crowdsourcing, wisdom of crowds, lists, event sequences
I will describe work on three areas related to crowd-based user-centered modeling:Read More...
1) Growing Lists:
We want to combining the knowledge of many people (experts) in order to create "sets" of things that go together, starting from a small seed. The experts have varying levels of expertise. This is the same problem that Google Sets was designed to solve. (With Ben Letham and Katherine Heller)
2) Sequential Event Prediction for Personalized Recommendations:
We are given a "sequence database" of past event sequences to learn from (like sequences of products purchased by customers), and we aim to predict the next event within a current event sequence (the next product purchased). We focus on applications where the set of the past events has predictive power and not the specific order of those past events. This is useful for all different kinds of recommender systems and search engines. (With Ben Letham and David Madigan)
3) Approximating the Crowd on a Budget:
The problem of "approximating the crowd" is that of estimating the crowd's majority opinion by querying only a subset of it. Algorithms that approximate the crowd can intelligently stretch a limited budget for a crowdsourcing task, and must balance between exploring the quality of the labelers and exploiting the best ones. (With Seyda Ertekin and Haym Hirsh)
User interest modeling attempts to represent user interests in a form that can be used to improve system support when users are searching for, selecting from, and browsing documents or other resources. Work on recognizing user interests based on their prior activities, such as their browsing behavior, is a common approach to implicit user interest modeling. The work presented expands on this approach by aggregating activity across multiple end-user applications. This talk presents the evolution of the Interest Profile Manager, a local application that collects activity data and acts as a service to those applications seeking to better support information access.
Keywords of the presentation: privacy, user data, social networks
"We do not collect personally identifiable information"... "This dataset
have been de-identified prior to release"... From advertisers tracking Web
clicks to biomedical researchers sharing clinical records, anonymization
is the main privacy protection mechanism used for sensitive user data
I will argue that the distinction between "personally identifiable" and
"non-personally identifiable" information is fallacious by showing how to
infer private information from fully anonymized data in three settings:
(1) records of individual transactions and preferences, illustrated by the
Netflix Prize dataset, (2) social networks, and (3) recommender systems,
where temporal changes in aggregate statistics allow accurate inference
of hidden individual transactions.
I will then outline a program for data privacy research. It includes
several challenging problems in the design and implementation of
privacy-preserving systems, domain-specific algorithmic research,
as well as policy and economic issues.
The effectiveness of online display ads beyond simple click-through evaluation is not well established in the literature. Are the high conversion rates seen for subsets of browsers the result of choosing to display ads to a group that has a naturally higher tendency to convert or does the advertisement itself cause an additional lift? How does showing an ad to different segments of the population affect their tendencies to take a specific action, or convert? We present an approach for assessing the effect of display advertising on customer conversion that does not require the cumbersome and expensive setup of a controlled experiment, but rather uses the observed events in a regular campaign setting. The general approach can be applied to many additional types of causal questions in display advertising and beyond. The approach relies on four steps:
- Defining the question of interest.
- Using domain knowledge and temporal cues to establish causal assumptions.
- Choosing a parameter of interest that directly answers the question of interest under the causal assumption.
- Estimating the parameter of interest as well as possible using Targeted Maximum Likelihood Estimation (TMLE), a double robust estimating procedure.
We apply the above approach to several display advertising campaigns for m6d, a display advertising company that observes over 5 billion actions a day and uses that data along with machine learning algorithms to determine the best prospects for a brand.
Relevance judgments provided by a neonatal human microbiome researcher were used to predict the relevance of additional publications to the researcher's information need. Six PubMed queries were run to retrieve documents which the researcher judged for relevance. These relevance judgments were used to produce training and test sets for the evaluation of two machine learning algorithms: C4.5 and support vector machines. These algorithms were evaluated in two ways: 1) tenfold cross-validation and 2) training on publications from 2008-2010 and testing on documents from 2011. It was found that the researcher's relevance judgments could be used to accurately predict relevance.
Keywords of the presentation: Crowdsourcing, Bias
Biased labelers are a systemic problem in crowdsourcing, and a
comprehensive toolbox for handling their responses is still being
developed. A typical crowdsourcing application can be divided into
three steps: data collection, data curation, and learning. At present
these steps are often treated separately. We present Bayesian Bias
Mitigation for Crowdsourcing (BBMC), a Bayesian model to unify all
three. Most data curation methods account for the effects of
labeler bias by modeling all labels as coming from a single latent
truth. Our model captures the sources of bias by describing
labelers as influenced by shared random effects. This approach can
account for more complex bias patterns that arise in ambiguous or hard
labeling tasks and allows us to merge data curation and learning into
a single computation. Active learning integrates data collection with
learning, but is commonly considered infeasible with Gibbs sampling
inference. We propose a general approximation strategy for Markov
chains to efficiently quantify the effect of a perturbation on the
stationary distribution and specialize this approach to allow active
learning with Gibbs sampling in our model. Experiments show BBMC to
outperform many common heuristics when a useful consensus labelling
cannot be estimated.