Showing posts from March, 2012

Generating Data for benchmarking clustering algorithms

As I have been working on some clustering algorithms recently, I invested some time last weekend to refactor some code inside sklearn to generate some toy data sets to visualize the results of clustering algorithms.
That looks something like this:

While the first two are nice to show off that your algorithm can handle non-convex clusters, these data sets obviously look nothing like the data you'll see in practice.

So I wanted to have some a bit more general data set generator.
What I ended up doing is a nonparametric mixture of Gaussians.
While Gaussians are a bit boring, combining them with a non-parametric prior makes them somewhat more general.

As I didn't found some very easy to use package to do that (though David pointed out pymc) I went ahead and wrote the generative model down myself.

It's a mixture of Gaussians with a Chinese restaurant process as prior for the mixture components and Wishard-Gaussian priors for mean and variance.
You can find the code here.
With t…