Previously, a post on a favorite clustering algorithm, k-means.
I'm not sure if clustering is technically considered a form of dimensionality reduction, but it certainly is similar in effect, especially when it comes to making inferences. For example, in spite of all the variation we see in human beings, simply specifying the sex of an individual can transmit a great deal of information from which good inferences can be made. Similarly, one can infer a great deal by knowing an individual has lived all of his or her life in a particular country.
In this post, I'm going to briefly discuss another clustering method called "hierarchical clustering" or "agglomerative hierarchical clustering." After titling the previous post "Galaxies," I've titled this one "Constellations." I'm not sure how apropos the title is, but perhaps there's mnemonic utility in it. And, of course, I've created another applet so one can see the algorithm in action, which is what always seems to bring me the most insight into such things (and the results look kinda constellation-like).
So you're looking at stars in the sky again, and this time you want to use hierarchical clustering to group them together. How do you it? It's easy. In the previous case of k-means, we iterated through all the stars and assigned each star to a group on the first pass, but hierarchical clustering doesn't work like that. In the case of hierarchical clustering, every star is initially a cluster unto itself--that is, if you have 100 stars, you start out with 100 clusters.
Given that initial state of a point being a cluster, the algorithm requires only one rule applied repeatedly:
1. Find the two "most similar" clusters and merge them together.
Lather, rinse, repeat.
Playing along with the constellation theme, each star begins as constellation by itself, and from there the "most similar" constellations are repeatedly merged until you're "satisfied" (e.g., you've got the number of clusters you want, etc.).
There's a nagging question: What do you mean by most similar? This can be a number of criteria or some combination thereof. Similarity might entail similar clusters being close together. Alternatively, similar clusters might be clusters that are far away from other clusters. Etc. The metric I use is in my demo applet is distance between clusters, where cluster position is determined by the cluster's center of mass.
Anyhow, here's an applet for you to play with:
If you'd like to learn more, I recommend Andrew Moore's slides: