Thursday, January 24, 2008


Previously, a post on a favorite clustering algorithm, k-means.

I'm not sure if clustering is technically considered a form of dimensionality reduction, but it certainly is similar in effect, especially when it comes to making inferences. For example, in spite of all the variation we see in human beings, simply specifying the sex of an individual can transmit a great deal of information from which good inferences can be made. Similarly, one can infer a great deal by knowing an individual has lived all of his or her life in a particular country.

In this post, I'm going to briefly discuss another clustering method called "hierarchical clustering" or "agglomerative hierarchical clustering." After titling the previous post "Galaxies," I've titled this one "Constellations." I'm not sure how apropos the title is, but perhaps there's mnemonic utility in it. And, of course, I've created another applet so one can see the algorithm in action, which is what always seems to bring me the most insight into such things (and the results look kinda constellation-like).

So you're looking at stars in the sky again, and this time you want to use hierarchical clustering to group them together. How do you it? It's easy. In the previous case of k-means, we iterated through all the stars and assigned each star to a group on the first pass, but hierarchical clustering doesn't work like that. In the case of hierarchical clustering, every star is initially a cluster unto itself--that is, if you have 100 stars, you start out with 100 clusters.

Given that initial state of a point being a cluster, the algorithm requires only one rule applied repeatedly:

1. Find the two "most similar" clusters and merge them together.

Lather, rinse, repeat.

Playing along with the constellation theme, each star begins as constellation by itself, and from there the "most similar" constellations are repeatedly merged until you're "satisfied" (e.g., you've got the number of clusters you want, etc.).

There's a nagging question: What do you mean by most similar? This can be a number of criteria or some combination thereof. Similarity might entail similar clusters being close together. Alternatively, similar clusters might be clusters that are far away from other clusters. Etc. The metric I use is in my demo applet is distance between clusters, where cluster position is determined by the cluster's center of mass.

Anyhow, here's an applet for you to play with:

hierarchical clustering applet

If you'd like to learn more, I recommend Andrew Moore's slides:

K-Means and Hierarchical Clustering.

Monday, January 21, 2008


Stars in the night sky are grouped into galaxies based on their proximity to other stars and other groups of stars. Although astronomers don't seem to have too much trouble identifying galaxies, the problem of clustering data points is a challenging algorithmic equivalent, and finding better algorithms is an import pursuit, because clustering algorithms have broad applications in areas ranging from computer vision to process optimization to movie recommendations.

One simple and effective clustering algorithm is known as k-means. The 'k' implies the number of groups needs to be specified beforehand. The algorithm doesn't try to figure out how many groups exist; rather, given a set of data points and the number of groups (k), the k-means algorithm does its best to sensibly divide the data points into k distinct groups, or clusters.

I'll stick with the star analogy in explaining how it works. Suppose there are a finite number of stars in some volume of space (data points). And also suppose k spaceships suddenly appear at random positions within this volume of space.

Step 1. For each star, find the nearest spaceship and declare it a member of the nearest spaceship's galaxy (cluster). Step 2. After assigning stars to spaceship, move each spaceship to the center of its galaxy (based on averaging all of the locations of the the stars in the galaxy). 3. Repeat Step 1 and Step 2 over and over until the spaceships stop moving.

That's it.

Here's a link to a little demo applet I wrote:


Thursday, January 17, 2008

What Does Goldman Know...

At, Michael Lewis discusses how Goldman Sachs managed to protect itself from the subprime morass:

"But at Goldman there were two intelligences at work: one, the ordinary Wall Street intelligence, which was allowed to get itself in trouble, just as at every other Wall Street firm; the other, more like an extremely smart hedge fund that made its living off the idiocy of big Wall Street firms, including its own people.

A Higher Intelligence

And this second, higher intelligence was allowed to make a mockery of the labors of the first. I can't think of another example of a big Wall Street firm saying so clearly through its trading positions as Goldman Sachs did over the past year that it thinks the rest of its industry, including its own people, is a bunch of idiots. They have obviously designed their firm to take into account their idiocy -- without ever having to put too fine a point on it."


(via The Big Picture)

Wednesday, January 16, 2008

The Suburbs

Once again fascinated by the plethora of content appearing on YouTube. This time, video of the Minneapolis band The Suburbs appearing on a variety show in 1981. Not sure where this clip came from, but the quality is outstanding given the age. The song is Music for Boys from the album Credit in Heaven.

Tuesday, January 15, 2008

Gotta Have My Orange Juice...

One of my heroes... Nobel laureate, quantum physics legend, genius... the late Richard Feynman... Hammering on bongos and belting out a song, the title of which I can only guess is Gotta Have My Orange Juice.

Desperately seeking empty rectangle...

Since the birth of commercially available graphical user interfaces over two decades ago, we've used blank rectangles to indicate where information needs to be typed. In the business, we call the user interface elements "edit controls."

Lately, thanks to the latest user interface trend, I've found myself frustratedly hunting for them, because designers are filling them with instructions like "Enter a word or phrase..."

The example above is far from the most egregious case of this. I've seen much worse at several other sites.

What used to be an easily discovered block of white space has now become more text in a sea of text. The longstanding tradition of informative text inside a rectangle being immutable exacerbates the situation even more.

Monday, January 14, 2008

Mr. Rogers goes to Washington

An old mentor of mine and I used to take turns assigning all the trouble in society to singular causes and, although we knew full well the real reasons were manifold, our futile, reductionist stabs usually led to engaging discussions.

While I bemoaned egalitarianism gone too far, the old professor kvetched about individualism taken to the extreme. As he saw things, Americans were becoming too selfish--they were losing an important sense of civic duty and concern for others and the greater good.

The following video of Mr. Rogers pleading his case at a 1969 Senate hearing leaves me recalling the claims of my wise friend now long gone. Sadly, it seems people like Rogers are getting rarer.

YouTube caption: " In 1969, the US Senate had a hearing on funding the proposed Corporation for Public Broadcasting, but President Nixon wanted it cut in half for the Vietnam War. Sometimes nice guys do win some."

Tuesday, January 08, 2008

Software Engineers of Tomorrow

Lately making the rounds is a DoD journal article on deficiencies in today's software engineering programs. I've often noted the influences of Lisp and Ada in my own education, and while I don't feel I've been sufficiently exposed to recent college grads or paid enough attention to current programs to make any judgments of my own, I do strongly concur with the authors with regard to the elements of software engineering programs identified as important*.

Computer Science Education: Where Are the Software Engineers of Tomorrow?

"It is our view that Computer Science (CS) education is neglecting basic skills, in particular in the areas of programming and formal methods. We consider that the general adoption of Java as a first programming language is in part responsible for this decline. We examine briefly the set of programming skills that should be part of every software professional’s repertoire."


Joel Spolsky chimes in on the subject again: link

*The only point I'd argue with the authors over is the efficacy of the encapsulation facilities in C++.

Monday, January 07, 2008

Running 500 Miles

Today's link. The story of Cliff Young, a 61 year-old potato farmer who won the world's toughest race:

"The whole nation thought he was a crazy old man to undertake an almost impossible feat. Most feared that he would die trying. But this humble old man proved all the critics wrong.Cliff Young, at 61 years of age, participated in 1983’s Sydney to Melbourne race. Considered to be the world’s toughest race, with the distance of 875 (543.7 m) kilometers and took at least 5 days to finish, Cliff Young entered the race against world-class athletes. Read how he achieved the unthinkable and inspires the whole nation."

The whole story is really quite amazing, humorous and heart-warming:


via Reddit

Saturday, January 05, 2008

Happy New Year!

A little late. My candidate for Baby New Year...

Why is this still so incredibly funny ?

Thursday, January 03, 2008

The Last Emperor

Lately, I've found little time to post anything here (or do much of anything else for that matter), so I think I'll throw out a recommendation for a favorite film that recently enjoyed a twentieth birthday. Bernardo Bertolucci's The Last Emperor.

Why do I think it's a great film? A number of reasons. The story of Pu Yi, the last emperor of China, is a fascinating story. Not only did Bertolucci manage to get permission to film in the Forbidden City and offer a first glipse of the site to the rest of the world (good Discover Channel doc: 1,2,3,4,5), it is beautifully shot, and a number of the scenes are unforgettable .

There's the acting of John Lone, Joan Chen and Peter O'Toole. The soundtrack includes some of Ryuichi Sakamoto's finest work. The director's cut runs twenty minutes short of four hours, but I could still watch more. Nine Oscars, here's the trailer.