Generating Data for benchmarking clustering algorithms

March 10, 2012

As I have been working on some clustering algorithms recently, I invested some time last weekend to refactor some code inside sklearn to generate some toy data sets to visualize the results of clustering algorithms.
That looks something like this:

While the first two are nice to show off that your algorithm can handle non-convex clusters, these data sets obviously look nothing like the data you'll see in practice.

So I wanted to have some a bit more general data set generator.
What I ended up doing is a nonparametric mixture of Gaussians.
While Gaussians are a bit boring, combining them with a non-parametric prior makes them somewhat more general.

As I didn't found some very easy to use package to do that (though David pointed out pymc) I went ahead and wrote the generative model down myself.

It's a mixture of Gaussians with a Chinese restaurant process as prior for the mixture components and Wishard-Gaussian priors for mean and variance.
You can find the code here.
With this class, you can generate a dataset by:

dpgmm = DPGMMSampler(alpha=10., deg=10, sigma=3, n_features=2)
X = dpgmm.sample(n_samples=100)

Where alpha is the parameter of the Chinese restaurant process, deg is the degrees of freedom of the (assumed diagonal) Wishart prior and sigma is the (diagonal) standard-deviation of the Gaussian prior over means. Here are some examples of what X might look like given the above parameters.

Here, the means of the Gaussians are marked with red diamonds, while the entries of X are blue dots.
The code was pretty straight-forward (although it's not as fast as it could be I guess), except for drawing from the Wishart distribution. That was a bit annoying.
I got the idea how to do it from this. It would be great if scipy could integrate something similar in the future.

Btw, I have not really proved correctness of my code, as the main point was to generate some nice samples. If you need to know that the model is correct, you might want to check the above reference and the code ;)

Search This Blog

Peekaboo

Generating Data for benchmarking clustering algorithms

Comments

Post a Comment

Popular posts from this blog

Machine Learning Cheat Sheet (for scikit-learn)

A Wordcloud in Python

Kernel Approximations for Efficient SVMs (and other feature extraction methods) [update]

MNIST for ever....

Python things you never need: Empty lambda functions