Data Clustering (Final Report)

Mentors: Gilad Lerman, Mark Iwen

IMA REU Program: Schedule

Generally, clustering methods attempt to separate input data into several coherent types, or clusters, based on a meaningful metric. As an example of a clustering problem, consider the data points pictured below:

Assuming we want to cluster this data based on Euclidean distance, most reasonable individuals would agree there are three clusters. However, in more complicated settings it can become very difficult to both determine how many clusters there are, and then to assign each data point to the proper cluster. This summer we will investigate both of these issues, focusing primarily on the first question above: "How many clusters are there?"

Introductory Talk: Power point slides.

Final Report Template: Latex Example.

Slide Presentation Template: Latex Example.

Poster Presentation Templates: Latex Example 1, Latex Example 2, and Power Point Example.

Group Goals

We will focus on methods for determining the number of clusters in a dataset. In the process of studying these methods we will try to accomplish as many of the following goals as we can: Of course, you should not expect to accomplish all of these goals! The course of study and pace of work is ultimately up to you.

Papers on K-Means and variants

Papers on general clustering problems

Code for determining number of clusters:

Data for comparing clustering methods:

    Matlab Code for Generating Random Datasets

  • An example `.m' file that creates a 2D dataset with 3 clusters. It can also be modified to generate other artificial data (with different numbers of clusters, dimensions, and underlying distributions).
  • The following matlab package contains a file called "generate_samples.m" for generating hybrid linear models. It is part of the larger GPCA package. In order to avoid intersection of subspaces (so that standard clustering could be applied) one needs to set the parameter avoidIntersection = TRUE (and also have affine subspaces instead of linear).

    Other Data and Data repositories

  • Clustering datasets at UCI Repository
  • Complete UCI Machine Learning Repository
  • Yale Face Database B
  • Some processed face datasets saved as Matlab data can be found here. Two matrices, X and Y, are included. If you plot Y(1:3,:) you will see three clearly separated clusters. The first 64 points are in one cluster, the next 64 points in another cluster, etc.. The original files are on the Yale Face Database B webpage (above). The folder names are yaleB5_P00, yaleB8_P00, yaleB10_P00. They have been processed following the steps described in Section 4.2.2 of the following paper. The matlab code used for processing them is here.
  • Here is an example of spectral clustering data. It contains points from 2 noisy circles: after loading the `.mat' file type "plot(X(:,1),X(:,2),'LineStyle','.');" to see them. You can embed them into 2D space for clustering with EmbedCircles.m. Note that changing sigma in this file will lead to different problems.

For after work, or on the weekends -- A list of Fun Stuff to do.