K-means on CUDA with CUV [edit]
Since I felt that our CUDA library, CUV, was a little lifeless without some examples on how it can be used, I started to work on a K-means implementation using it.
The idea is to have something like
There are some papers on kmeans on the GPU out there but I thought I see
how far a little coding will get me.
As an (maybe a little atypical) example, I choose MNIST as a dataset. My experiments were with k=10 or k=20 but this should not be a restriction on the implementation.
First, I tried to do it without adding anything to CUV, just using simple matrix operations that were already there.
The result is something like this:
So I push everything to the device and start with some random samples as clusters.
I compute pairwise distances using the "binomial formular trick" (see Alex Smola's blog).
Then I use an argmax to get the cluster assignements. So far, so good.But recomputing the clusters is somewhat cumbersome. At that point, there was no specific kernel for doing that. So I resorted to a pretty naive method.
I pull everything to the host, find the samples belonging to a given cluster, multiply all samples with an indication vector and sum over that.
Pretty ugly but works. To get a feel how (in)efficient that implementation is, I compared with scipy.cluster.
The result for 10 iterations and 10 clusters:
More than 10x speedup. Not to bad for a first try. And without doing anything in C++/Cuda.
I think that illustrates nicely how even naive CUDA implementations can improve a lot over
CPU implementations and how easy it is to build an algorithm in python starting from the
CUV functionallity.
But of course, I was not satisfied with that.
So I wrote a kernel to do what the for-loop does in the code above.
I am actually a bit sceptical about the last numbers. They just seem to good to be true.
But check it out yourself. The code is in the examples folder on our github page.
There are some restrictions on the number of clusters at the moment. I definitely want it to work for bag of visual words applications, which I guess means 128 dimensional features and at most 1000 clusters.
I haven't tried that yet but I will. If I can beat ikmeans in vlfeat, that would be quite nice :) (it's a SSE2 optimized version which I am currently using).
To be a little bit more specific about the restrictions on cuv_python.compute_clusters at the moment:
16*(16+1)*num_clusters*sizeof(featuretype) have to fit into the shared memory of your card. Of course one can just change the 16 to something else.
But I think my current implementation will scale badly with the number of clusters.
Oh and as an afterthough, in the first version you might have noticed the line
[edit]
The version I uploaded works up to 40 clusters. With minor modifications about 700 clusters are possible. I think that should be enough for most BoW applications. Measurements on MNIST with 700 clusters show a speedup of about two orders of magnitude for 10 iterations compared to ikmeans in vlfeat (using my python wrappers). I used the makefile that is shipped with vlfeat but I am not sure if it turns on all optimizations and I think it works only on one CPU. So this is just a rough estimate. I'll compare results in a little more detail if I find the time.[\edit]
The idea is to have something like
from cuv_python import kmeans [clusters, assignements]=kmeans(data)run k-means on the GPU.
There are some papers on kmeans on the GPU out there but I thought I see
how far a little coding will get me.
As an (maybe a little atypical) example, I choose MNIST as a dataset. My experiments were with k=10 or k=20 but this should not be a restriction on the implementation.
First, I tried to do it without adding anything to CUV, just using simple matrix operations that were already there.
The result is something like this:
clusters=mnist[:,rand_indices] mnist_dev=cp.push(mnist) # copy('F') is necessary so we can slice later on clusters_dev=cp.push(clusters.copy("F")) norms = cp.dev_matrix_cmf(mnist_dev.w, 1) cp.reduce_to_row(norms.vec,mnist_dev,cp.reduce_functor.ADD_SQUARED) norms_clusters=cp.dev_matrix_cmf(num_clusters,1) dists = cp.dev_matrix_cmf(mnist_dev.w, num_clusters) nearest= cp.dev_matrix_cmf(mnist_dev.w,1) nearest_dist= cp.dev_matrix_cmf(mnist_dev.w,1) for i in xrange(10): cp.reduce_to_row(norms_clusters.vec,clusters_dev, cp.reduce_functor.ADD_SQUARED) cp.prod(dists, mnist_dev, clusters_dev, 't','n',-2, 0) cp.matrix_plus_row(dists,norms_clusters.vec) cp.matrix_plus_col(dists,norms.vec) cp.reduce_to_col(nearest.vec,dists,cp.reduce_functor.ARGMIN) # compute distances as possible stopping criterion cp.reduce_to_col(nearest_dist.vec,dists,cp.reduce_functor.MIN) #print("Average dist to cluster: %f"%cp.mean(nearest_dist)) nearest_host = nearest.np for j in xrange(num_clusters): indices = nearest_host==j indices_dev=cp.push(indices) tmp=mnist_dev.copy() cp.matrix_times_row(tmp,indices_dev.vec) mean_dev=cp.dev_matrix_cmf(mnist_dev.h,1) cp.reduce_to_col(mean_dev.vec,tmp) mean_dev*=1./indices.sum() clusters_dev[:,j]=mean_dev cp.pull(clusters_dev)
So I push everything to the device and start with some random samples as clusters.
I compute pairwise distances using the "binomial formular trick" (see Alex Smola's blog).
Then I use an argmax to get the cluster assignements. So far, so good.But recomputing the clusters is somewhat cumbersome. At that point, there was no specific kernel for doing that. So I resorted to a pretty naive method.
I pull everything to the host, find the samples belonging to a given cluster, multiply all samples with an indication vector and sum over that.
Pretty ugly but works. To get a feel how (in)efficient that implementation is, I compared with scipy.cluster.
The result for 10 iterations and 10 clusters:
Scipy time: 378.315680 CUV naive time: 32.112533
More than 10x speedup. Not to bad for a first try. And without doing anything in C++/Cuda.
I think that illustrates nicely how even naive CUDA implementations can improve a lot over
CPU implementations and how easy it is to build an algorithm in python starting from the
CUV functionallity.
But of course, I was not satisfied with that.
So I wrote a kernel to do what the for-loop does in the code above.
for i in xrange(10): cp.reduce_to_row(norms_clusters.vec,clusters_dev, cp.reduce_functor.ADD_SQUARED) cp.prod(dists, mnist_dev, clusters_dev, 't','n',-2, 0) cp.matrix_plus_row(dists,norms_clusters.vec) cp.matrix_plus_col(dists,norms.vec) cp.reduce_to_col(nearest.vec,dists,cp.reduce_functor.ARGMIN) # compute distances as possible stopping criterion cp.reduce_to_col(nearest_dist.vec,dists,cp.reduce_functor.MIN) #print("Average dist to cluster: %f"%cp.mean(nearest_dist)) cp.compute_clusters(clusters_dev,mnist_dev,nearest.vec) cp.pull(clusters_dev)The result:
CUV naive time: 32.111770 CUV optimized time: 0.159589Wow! And that eventhough "optimized" here is an euphemism for "first thing that worked".
I am actually a bit sceptical about the last numbers. They just seem to good to be true.
But check it out yourself. The code is in the examples folder on our github page.
There are some restrictions on the number of clusters at the moment. I definitely want it to work for bag of visual words applications, which I guess means 128 dimensional features and at most 1000 clusters.
I haven't tried that yet but I will. If I can beat ikmeans in vlfeat, that would be quite nice :) (it's a SSE2 optimized version which I am currently using).
To be a little bit more specific about the restrictions on cuv_python.compute_clusters at the moment:
16*(16+1)*num_clusters*sizeof(featuretype) have to fit into the shared memory of your card. Of course one can just change the 16 to something else.
But I think my current implementation will scale badly with the number of clusters.
Oh and as an afterthough, in the first version you might have noticed the line
clusters_dev[:,j]=mean_devand yes, that is python slicing of device matrices. I love syntactic sugar :)
[edit]
The version I uploaded works up to 40 clusters. With minor modifications about 700 clusters are possible. I think that should be enough for most BoW applications. Measurements on MNIST with 700 clusters show a speedup of about two orders of magnitude for 10 iterations compared to ikmeans in vlfeat (using my python wrappers). I used the makefile that is shipped with vlfeat but I am not sure if it turns on all optimizations and I think it works only on one CPU. So this is just a rough estimate. I'll compare results in a little more detail if I find the time.[\edit]
Very cool! I'll have to give this a try this weekend.
ReplyDelete