Predicting with Proxies
Wednesday, October 3, 2018 - 2:00pm - 2:45pm
Predictive analytics is increasingly used to guide decision-making in many applications. However, in practice, we often have limited data on the true outcome that we wish to predict, but copious data on an intermediate or proxy outcome. Practitioners often train predictive models on proxies since it achieves more accurate predictions. For example, Amazon uses its abundant customer click (proxy) data to make product recommendations rather than its relatively sparse customer purchase (true) data; analogously, hospitals use frequently-observed patient readmissions rates (proxy) rather than mortality rates (true) to assign interventions. However, not accounting for the bias in the proxy can lead to sub-optimal decisions. We propose a novel estimator that uses techniques from high-dimensional statistics to efficiently combine a large amount of proxy data and a small amount of true data. We prove upper bounds on the error of our proposed estimator and lower bounds on several baselines; in particular, our proposed estimator achieves provably better performance than heuristics commonly used by data scientists. Finally, we demonstrate the effectiveness of this approach on an e-commerce and a healthcare dataset; in both cases, we achieve significantly better predictive accuracy as well as managerial insights into the nature of the bias in the proxy data.