Comments on Peekaboo: Kernel Approximations for Efficient SVMs (and other feature extraction methods) [update]

2015-08-24T09:32:54.062+02:00

This comment has been removed by the author.

SMO is technically O(N^3), however it empirically ...

2014-01-28T06:35:58.143+01:00

SMO is technically O(N^3), however it empirically behaves as O(N^2+eps) for many problems. This is noted in Platt's original paper as well as the paper that introduced SVM^light.

Thanks :) I don't think SMO has O(N^2). It sti...

2014-01-21T08:13:55.205+01:00

Thanks :)
I don't think SMO has O(N^2). It still solves a QP. In practice it is much faster than standard solvers, but there is a lot of heuristics involved. Can you give a reference for the O(N^2)? I just skimmed the Platt paper but couldn't find any claim.

Great post, I learnt a lot. For kernelized SVM you...

2014-01-17T08:14:57.830+01:00

Great post, I learnt a lot.
For kernelized SVM you have written: "in general you can assume that the run time is cubic in the number of samples". Doesn't SMO takes O(N^2) time,where N is the number of samples?

Excellent post, I was looking for a combination of...

2013-09-21T01:00:42.621+02:00

Excellent post, I was looking for a combination of SGD and kernelization and BINGO :)

Thanks for the nice review. I am looking to implem...

2013-07-11T01:31:41.525+02:00

Thanks for the nice review.
I am looking to implement Kernel Fisher Discriminant, and supervised Gaussian process latent variable model (based on some papers I found online), neither one present in sklearn. I appreciate if you have any insights into this.

Thanks. Yes, it should be. After some experiments ...

2013-06-20T21:29:22.360+02:00

Thanks. Yes, it should be. After some experiments I settled with MathJax: http://www.mathjax.org/ It is the definite answer.

nice post. There is a typo in the polynomial kerne...

2013-06-20T21:18:48.681+02:00

nice post. There is a typo in the polynomial kernel. It should be k(x, y) = (x^Ty + c)^d. On a side not, what do you use to put equations on a blogspot blog.

Thank you for your comments and thank you for shar...

2013-01-13T11:15:14.656+01:00

Thank you for your comments and thank you for sharing your experience.

I knew there should be a way to find the embedding without SVD, I just didn't really have time to investigate. Could you please explain how the Cholesky decomposition might be used?

The K-Means initialization of the prototype vectors is definitely something I should include in the implementation.

It's possible to implement the Nystrom map usi...

2013-01-13T04:31:59.914+01:00

It's possible to implement the Nystrom map using a diagonal-pivot Cholesky instead of the SVD. (The standard Cholesky might run into problems if the gram matrix of the "prototype" vectors is not PD.) In particular, using a Cholesky Crout, you can compute the gram matrix on the fly. The on-the-fly Cholesky also gives rise to a method to select the prototype vectors---though the K-means method seems to be much better in practice, but it can still be used to compute the feature map for unseen data once the prototype vectors are selected.

the map that is created such that the scalar produ...

2012-12-28T22:10:30.418+01:00

the map that is created such that the scalar product in the embedding space is approximately the kernel. maybe i should have said that more explicitly. so yes, you can use it for svr, kernel-pca, kernel-kmeans, anything. let me know how it goes!

Great analysis! I'll keep the kernel approxima...

2012-12-28T21:51:27.486+01:00

Great analysis! I'll keep the kernel approximation in mind next time I work with SVMs in sklearn.

The kernel approximations just generate values approximating the higher-dimension space. So, in practice, they could also be used for SVR, right?

I used LinearSVC as I did not want to mess with th...

2012-12-28T09:06:34.816+01:00

I used LinearSVC as I did not want to mess with the additional n_iter hyperparam of SGDClassifier and PassiveAggressiveClassifier. We should definitely implement early stopping for those models.

LogisticRegression seems to yield similar performance as LinearSVC but is a bit slower to converge on this data.

As for 3-nn on PCA data, this is an interesting datapoint but it does not compress the data very much and the prediction speed should be quite slow.

Maybe running k-means with 100 centers per class to summarize the data and then running 3-nn versus the kmeans centers would yield good results, possibly after the soft-thresholded 1000 k-means transform.

A very good work and many thanks,

2012-12-28T08:04:35.977+01:00

A very good work and many thanks,

Sounds interesting :) Btw, if you do a pca to 30 d...

2012-12-28T02:02:04.625+01:00

Sounds interesting :) Btw, if you do a pca to 30 dimensions on all samples a 3-nn gets >98% ;)
I didn't experiment much as the exact SVM takes so long on all examples :-/ I rather wanted to get a general feel. You could also compare an exact kernel on less data with an approximate kernel on more data (given a certain computational budget).

Did you use LinearSVC or SGD? Lower C generally makes it faster but I seemed to me less accurate.

Btw, the gamma and C that I used are more or less optimal for the exact rbf. I don't have the models any more, though and I'm only on my laptop.

I experimented with an alternative kernel expansio...

2012-12-28T01:47:22.801+01:00

I experimented with an alternative kernel expansion base on soft-thresholded cosine similarities to 1000 k-means center on the PCA dim reduced samples:

https://gist.github.com/4393530

It yields ~96% acc on 20k MNIST samples in ~16s on my laptop. Not bad either.

it would give less training time. not sure about a...

2012-12-28T00:59:20.835+01:00

it would give less training time. not sure about accuracy. this is only a subset of the training data.

How many support vectors do you get when using the...

2012-12-28T00:55:52.247+01:00

How many support vectors do you get when using the RBF SVC model (with grid searched C and gamma for optimal accuracy)?

It seems that if you MinMaxScaler.fit_transform th...

2012-12-28T00:53:29.103+01:00

It seems that if you MinMaxScaler.fit_transform the data and then use C=0.01 (high regularizer) for the baseline LinearSVC model you can ~0.90 test error and faster training times.

Thanks Olivier. The code is here: https://gist.git...

2012-12-27T12:25:11.092+01:00

Thanks Olivier.
The code is here: https://gist.github.com/4387511
I fixed parameters for all runs to gamma = 0.31 and C=100 (for linear and kernelized SVM) - these are values that I know work for the full dataset and exact kernel.

I just realized the approximate kernel SVMs used C=1. Damn. I guess I should run them again.

The linear SVM that runs directly on the data has an accuracy of .857, so it is slightly below the graph.

Yes, thanks :) Glad you like it. I wasn't sure...

2012-12-27T12:15:10.252+01:00

Yes, thanks :)
Glad you like it. I wasn't sure how easy it actually was to read ;)

Thank you for sharing. It is an easy read, like I ...

2012-12-27T09:05:33.666+01:00

Thank you for sharing. It is an easy read, like I like them. :)

I guess you meant k(x,y) = (x y^T + c)^d (not x xT), right?

Very interesting work. I had started something sim...

2012-12-27T04:19:59.277+01:00

Very interesting work. I had started something similar in the past, trying to estimate the minimum dimension needed for a kernel to provide linear separability for a dataset. It didn't get that much attention. You can find it here http://arxiv.org/pdf/0810.4611
You might find it useful.

Nice post Andreas! On the first plot, what is the ...

2012-12-27T01:03:08.126+01:00

Nice post Andreas! On the first plot, what is the test error of the linear SVM model? What are the value of the hyperparameters? Do they change w.r.t. the number of extracted features?