Large Scale Information Extraction from the Web

Friday, February 11, 2011 - 1:25pm - 2:25pm
Vincent 570
Sathiya Keerthi (Yahoo! Inc.)
The web has a vast wealth of information about various types of entities such as businesses (e.g., address, phone, category, hours of operation), products, books, doctors, etc. distributed over a very large number of web sites. Extracting this information from the websites can help us create extensive databases of the entities. These databases can then be used by search engines for better ranking and rendering of search results, e.g., a user can search for products with certain features. The websites usually contain the information in semi-structured formats which are varied and noisy. Extraction on a large scale is challenging because it is not feasible to provide supervision (say, via labeled examples) on a per site basis. In this talk I will give an overview of all the steps associated with a complete extraction pipeline and describe a few scalable machine learning approaches for large scale information extraction.

Dr. Keerthi is a Principal Research Scientist in Yahoo! Research. Over the last twenty years his research has focused on the development of practical algorithms for a variety of areas, such as machine learning, robotics, computer graphics and optimal control. His works on support vector machines (fast algorithms), polytope distance computation (GJK algorithm) and model predictive control (stability theory) are highly cited. His current research focuses on machine learning algorithms for structured outputs as applied to information extraction. Prior to joining Yahoo!, he worked for 10 years at the Indian Institute of Science, Bangalore, and for 5 years at the National University of Singapore. Dr. Keerthi is a member of the editorial board of Journal of Machine Learning Research.
MSC Code: