About the Talk: The web has a vast wealth of information about various types of entities such as businesses (e.g., address, phone, category, hours of operation), products, books, doctors, etc. distributed over a very large number of web sites. Extracting this information from the websites can help us create extensive databases of the entities. These databases can then be used by search engines for better ranking and rendering of search results, e.g., a user can search for products with certain features. The websites usually contain the information in semi-structured formats which are varied and noisy. Extraction on a large scale is challenging because it is not feasible to provide supervision (say, via labeled examples) on a per site basis. In this talk I will give an overview of all the steps associated with a complete extraction pipeline and describe a few scalable machine learning approaches for large scale information extraction.