Data Clustering (Final Report)

IMA REU Program: Schedule

Generally, clustering methods attempt to separate input data into several coherent types, or clusters, based on a meaningful metric. As an example of a clustering problem, consider the data points pictured below:

Assuming we want to cluster this data based on Euclidean distance, most reasonable individuals would agree there are three clusters. However, in more complicated settings it can become very difficult to both determine how many clusters there are, and then to assign each data point to the proper cluster. This summer we will investigate both of these issues, focusing primarily on the first question above: "How many clusters are there?"

Introductory Talk: Power point slides.

Final Report Template: Latex Example.

Slide Presentation Template: Latex Example.

Poster Presentation Templates: Latex Example 1, Latex Example 2, and Power Point Example.

Group Goals

We will focus on methods for determining the number of clusters in a dataset. In the process of studying these methods we will try to accomplish as many of the following goals as we can:
• Conduct a short literature survey based on the papers below. What is currently known about when various methods will correctly determine the number of clusters in a dataset?
• Compare existing methods on artificial and real data. Do your experiments agree with what you found in your literature survey?
• Use your experiments to conclude which methods perform best on which types of data sets. Can you prove some method of your choice will perform well/poorly on particular types of data?
• Using what you have learned, propose and test a new clustering method which behaves better than existing methods on some type(s) of data. What can you prove?
Of course, you should not expect to accomplish all of these goals! The course of study and pace of work is ultimately up to you.

 Papers on K-Means and variants The Uniqueness of a Good Optimum for K-Means, Marina Meila, Proceedings of the 23rd International Conference on Machine Learning, 2006 The Effectiveness of Lloyd-Type Methods for the k-Means Problem, Rafail Ostrovsky, Yuval Rabani, Leonard J. Schulman, and Chaitanya Swamy, SODA, 2007 Improved Smoothed Analysis of the k-Means Method, Bodo Manthey and Heiko Roglin, preprint, 2008 k-means++: The Advantages of Careful Seeding, David Arthur and Sergei Vassilvitskii. Slide format. Papers on general clustering problems An Impossibility Theorem for Clustering, Jon Kleinberg, preprint, 2002 When is Clustering Hard?, Nathan Srebro, Gregory Shakhnarovich, and Sam Roweis, PASCAL Workshop on Statistics and Optimization of Clustering, 2005 An Investigation of Computational and Informational Limits in Gaussian Mixture Clustering, Nathan Srebro, Gregory Shakhnarovich, and Sam Roweis, Proceedings of the 23rd International Conference on Machine Learning, 2006 Which Data Sets are `Clusterable'? -- A Theoretical Study of Clusterability, Margareta Ackerman and Shai Ben-David, preprint, 2008 Measures of Clustering Quality: A Working Set of Axioms for Clustering, Margareta Ackerman and Shai Ben-David, preprint, 2009 Information Theoretic Measures for Clusterings Comparison: Is a Correction for Chance Necessary?, Nguyen Xuan Vinh, Julien Epps, and James Bailey, Proceedings of the 26th International Conference on Machine Learning, 2009

 Code for determining number of clusters: R code (and Perl): R Tutorial Gap Statistic, in Perl code. OR, R code implementation. Model-Based Clustering, R code. C/C++ code X-means, C code. G-means, C++ code. Matlab Data Spectroscopic Clustering, matlab code. Self-tuning Spectral Clustering, matlab code.

 Data for comparing clustering methods: Matlab Code for Generating Random Datasets An example `.m' file that creates a 2D dataset with 3 clusters. It can also be modified to generate other artificial data (with different numbers of clusters, dimensions, and underlying distributions). The following matlab package contains a file called "generate_samples.m" for generating hybrid linear models. It is part of the larger GPCA package. In order to avoid intersection of subspaces (so that standard clustering could be applied) one needs to set the parameter avoidIntersection = TRUE (and also have affine subspaces instead of linear). Other Data and Data repositories Clustering datasets at UCI Repository Complete UCI Machine Learning Repository Yale Face Database B Some processed face datasets saved as Matlab data can be found here. Two matrices, X and Y, are included. If you plot Y(1:3,:) you will see three clearly separated clusters. The first 64 points are in one cluster, the next 64 points in another cluster, etc.. The original files are on the Yale Face Database B webpage (above). The folder names are yaleB5_P00, yaleB8_P00, yaleB10_P00. They have been processed following the steps described in Section 4.2.2 of the following paper. The matlab code used for processing them is here. Here is an example of spectral clustering data. It contains points from 2 noisy circles: after loading the `.mat' file type "plot(X(:,1),X(:,2),'LineStyle','.');" to see them. You can embed them into 2D space for clustering with EmbedCircles.m. Note that changing sigma in this file will lead to different problems.

For after work, or on the weekends -- A list of Fun Stuff to do.