Data Clustering (Final Report)
Mentors: Gilad Lerman, Mark Iwen
IMA REU Program: Schedule
Generally, clustering methods attempt to separate input data into several coherent types, or clusters, based on a meaningful metric. As an example of a
clustering problem, consider the data points pictured below:
Assuming we want to cluster this data based on Euclidean distance, most reasonable individuals would agree there are three clusters. However, in more
complicated settings it can become very difficult to both determine how many clusters there are, and then to assign each data point to the proper
cluster. This summer we will investigate both of these issues, focusing primarily on the first question above: "How many clusters are there?"
Introductory Talk: Power point slides.
Final Report Template: Latex Example.
Slide Presentation Template: Latex Example.
Poster Presentation Templates: Latex Example 1,
Latex Example 2, and Power Point Example.
We will focus on methods for determining the number of clusters in a dataset. In the process of studying these methods we will try to accomplish as many
of the following goals as we can:
Of course, you should not expect to accomplish all of these goals! The course of study and pace of work is ultimately up to you.
- Conduct a short literature survey based on the papers below. What is currently known about when various methods
will correctly determine the number of clusters in a dataset?
- Compare existing methods on artificial and real data. Do your experiments agree
with what you found in your literature survey?
- Use your experiments to conclude which methods perform best on which types of data sets. Can you prove some method of your choice will
perform well/poorly on particular types of data?
- Using what you have learned, propose and test a new clustering method which behaves better than existing methods on some type(s) of data.
What can you prove?
|Papers on K-Means and variants
- The Uniqueness of a Good Optimum for K-Means,
Marina Meila, Proceedings of the 23rd International Conference on Machine Learning, 2006
- The Effectiveness of Lloyd-Type Methods for the k-Means Problem,
Rafail Ostrovsky, Yuval Rabani, Leonard J. Schulman, and Chaitanya Swamy, SODA, 2007
- Improved Smoothed Analysis of the k-Means Method,
Bodo Manthey and Heiko Roglin, preprint, 2008
- k-means++: The Advantages of Careful Seeding,
David Arthur and Sergei Vassilvitskii.
|Papers on general clustering problems
- An Impossibility Theorem for Clustering,
Jon Kleinberg, preprint, 2002
- When is Clustering Hard?,
Nathan Srebro, Gregory Shakhnarovich, and Sam Roweis, PASCAL Workshop on Statistics and Optimization of Clustering, 2005
- An Investigation of Computational and Informational Limits in Gaussian Mixture Clustering,
Nathan Srebro, Gregory Shakhnarovich, and Sam Roweis, Proceedings of the 23rd International Conference on Machine Learning, 2006
- Which Data Sets are `Clusterable'? -- A Theoretical Study of Clusterability,
Margareta Ackerman and Shai Ben-David, preprint, 2008
- Measures of Clustering Quality: A Working Set of Axioms for Clustering,
Margareta Ackerman and Shai Ben-David, preprint, 2009
- Information Theoretic Measures for Clusterings Comparison: Is a Correction for Chance Necessary?,
Nguyen Xuan Vinh, Julien Epps, and James Bailey, Proceedings of the 26th International Conference on Machine Learning, 2009
|Data for comparing clustering methods:
Matlab Code for Generating Random Datasets
- An example `.m' file that creates a 2D dataset with 3 clusters. It can also be modified to generate other artificial
data (with different numbers of clusters, dimensions, and underlying distributions).
- The following matlab package contains a file called
"generate_samples.m" for generating hybrid linear models. It is part of the larger GPCA package.
In order to avoid intersection of subspaces (so that standard clustering could be applied) one needs to set the parameter avoidIntersection = TRUE
(and also have affine subspaces instead of linear).
Other Data and Data repositories
- Clustering datasets at UCI Repository
- Complete UCI Machine Learning Repository
- Yale Face Database B
- Some processed face datasets saved as Matlab data can be found here. Two matrices, X and Y, are included. If you plot Y(1:3,:) you will see three clearly separated clusters.
The first 64 points are in one cluster, the next 64 points in another cluster, etc.. The original files are on the Yale Face Database B webpage (above). The folder names
are yaleB5_P00, yaleB8_P00, yaleB10_P00. They have been processed following the steps described in Section 4.2.2 of the
following paper. The matlab code used for processing them is here.
- Here is an example of spectral clustering data. It contains points from 2 noisy circles: after loading the `.mat' file type
"plot(X(:,1),X(:,2),'LineStyle','.');" to see them. You can embed them into 2D space for clustering with EmbedCircles.m. Note that changing sigma
in this file will lead to different problems.
For after work, or on the weekends -- A list of Fun Stuff to do.