Showing posts from March, 2012

Generating Data for benchmarking clustering algorithms

As I have been working on some clustering algorithms recently, I invested some time last weekend to refactor some code inside sklearn to generate some toy data sets to visualize the results of clustering algorithms. That looks something like this: While the first two are nice to show off that your algorithm can handle non-convex clusters, these data sets obviously look nothing like the data you'll see in practice. So I wanted to have some a bit more general data set generator. What I ended up doing is a nonparametric mixture of Gaussians. While Gaussians are a bit boring, combining them with a non-parametric prior makes them somewhat more general. As I didn't found some very easy to use package to do that (though David pointed out p ymc ) I went ahead and wrote the generative model down myself. It's a mixture of Gaussians with a Chinese restaurant process as prior for the mixture components and Wishard-Gaussian priors for mean and variance. You can find the