Measuring and visualizing cultural diversity in online encyclopedias

Wednesday, December 31, 1969 - 6:00pm
Faculty Advisor: Shilad Sen, Department of Mathematics, Statistics and Computer Science, Macalester College

Problem Poser: Brent Hecht, Department of Computer Science and Engineering, University of Minnesota

People and technologies rely more and more on user­created datasets like Wikipedia. However, research has begun to reveal important cultural biases in these datasets. In this project, we will use state-­of-­the-­art techniques from the domains of artificial intelligence (AI) and natural language processing (NLP) to surface, quantify and visualize the similarities and differences in how various online encyclopedias describe the world.

We will analyze multiple language editions of Wikipedia (e.g. the English Wikipedia, the Hebrew Wikipedia, and the Arabic Wikipedia) as well as other online encyclopedias like Conservapedia, which describes itself as “written from a Christian fundamentalist viewpoint”, and Ecured, the Cuban government’s encyclopedia. We will investigate, for instance, the different descriptions of the topic ‘contraception’ present in these encyclopedias, visualize these differences using information visualization approaches, and do this with 'big data' methods that can handle the millions of articles in our dataset.

Required background: strong programming skills in Java, discrete math, data structures, one of statistics or algorithms

Useful background: social science experience (especially computational social science), experience with R, fluency in languages other than English, information retrieval.