Monday, January 21, 2008


Stars in the night sky are grouped into galaxies based on their proximity to other stars and other groups of stars. Although astronomers don't seem to have too much trouble identifying galaxies, the problem of clustering data points is a challenging algorithmic equivalent, and finding better algorithms is an import pursuit, because clustering algorithms have broad applications in areas ranging from computer vision to process optimization to movie recommendations.

One simple and effective clustering algorithm is known as k-means. The 'k' implies the number of groups needs to be specified beforehand. The algorithm doesn't try to figure out how many groups exist; rather, given a set of data points and the number of groups (k), the k-means algorithm does its best to sensibly divide the data points into k distinct groups, or clusters.

I'll stick with the star analogy in explaining how it works. Suppose there are a finite number of stars in some volume of space (data points). And also suppose k spaceships suddenly appear at random positions within this volume of space.

Step 1. For each star, find the nearest spaceship and declare it a member of the nearest spaceship's galaxy (cluster). Step 2. After assigning stars to spaceship, move each spaceship to the center of its galaxy (based on averaging all of the locations of the the stars in the galaxy). 3. Repeat Step 1 and Step 2 over and over until the spaceships stop moving.

That's it.

Here's a link to a little demo applet I wrote:



Post a Comment

Subscribe to Post Comments [Atom]

<< Home