tag:blogger.com,1999:blog-7345806147365425073.post9105585920150097490..comments2017-06-05T12:57:55.785+02:00Comments on Peekaboo: Kernel Approximations for Efficient SVMs (and other feature extraction methods) [update]Andreas Muellerhttps://plus.google.com/112544080214919445884noreply@blogger.comBlogger25125tag:blogger.com,1999:blog-7345806147365425073.post-75382005864028546472015-08-24T09:34:35.675+02:002015-08-24T09:34:35.675+02:00Nice Website...
Hey JOIN now fblikesbot.com and In...Nice Website...<br />Hey JOIN now fblikesbot.com and Increase Facebook Likes your profile and websites.<br /><a href="http://www.fblikesbot.com" rel="nofollow">Increase Facebook Likes</a> and check your website worth <a href="http://www.worthmywebsites.com" rel="nofollow">worth my websites</a> <br />its may be very beneficial for you also reallyAnonymousnoreply@blogger.comtag:blogger.com,1999:blog-7345806147365425073.post-43812136818023889882015-08-24T09:32:54.062+02:002015-08-24T09:32:54.062+02:00This comment has been removed by the author.Shiya Priyahttp://www.blogger.com/profile/04485413466562852725noreply@blogger.comtag:blogger.com,1999:blog-7345806147365425073.post-62186212093490415342014-01-28T06:35:58.143+01:002014-01-28T06:35:58.143+01:00SMO is technically O(N^3), however it empirically ...SMO is technically O(N^3), however it empirically behaves as O(N^2+eps) for many problems. This is noted in Platt's original paper as well as the paper that introduced SVM^light. Edward Raffhttp://www.blogger.com/profile/04050596766362107081noreply@blogger.comtag:blogger.com,1999:blog-7345806147365425073.post-71162652421971031512014-01-21T08:13:55.205+01:002014-01-21T08:13:55.205+01:00Thanks :)
I don't think SMO has O(N^2). It sti...Thanks :)<br />I don't think SMO has O(N^2). It still solves a QP. In practice it is much faster than standard solvers, but there is a lot of heuristics involved. Can you give a reference for the O(N^2)? I just skimmed the Platt paper but couldn't find any claim.Andreas Muellerhttp://www.blogger.com/profile/10177962095394942563noreply@blogger.comtag:blogger.com,1999:blog-7345806147365425073.post-57724650995766105152014-01-17T08:14:57.830+01:002014-01-17T08:14:57.830+01:00Great post, I learnt a lot.
For kernelized SVM you...Great post, I learnt a lot.<br />For kernelized SVM you have written: "in general you can assume that the run time is cubic in the number of samples". Doesn't SMO takes O(N^2) time,where N is the number of samples? <br />Piyush Bhardwajhttp://www.blogger.com/profile/06546533822985933679noreply@blogger.comtag:blogger.com,1999:blog-7345806147365425073.post-73098927551240460392013-09-21T01:00:42.621+02:002013-09-21T01:00:42.621+02:00Excellent post, I was looking for a combination of...Excellent post, I was looking for a combination of SGD and kernelization and BINGO :)Marionoreply@blogger.comtag:blogger.com,1999:blog-7345806147365425073.post-16509408179702901232013-07-11T01:31:41.525+02:002013-07-11T01:31:41.525+02:00Thanks for the nice review.
I am looking to implem...Thanks for the nice review.<br />I am looking to implement Kernel Fisher Discriminant, and supervised Gaussian process latent variable model (based on some papers I found online), neither one present in sklearn. I appreciate if you have any insights into this.Dashesyhttp://www.blogger.com/profile/09867642749547058056noreply@blogger.comtag:blogger.com,1999:blog-7345806147365425073.post-61537078840826757902013-06-20T21:29:22.360+02:002013-06-20T21:29:22.360+02:00Thanks. Yes, it should be. After some experiments ...Thanks. Yes, it should be. After some experiments I settled with MathJax: http://www.mathjax.org/ It is the definite answer.Andreas Muellerhttp://www.blogger.com/profile/10177962095394942563noreply@blogger.comtag:blogger.com,1999:blog-7345806147365425073.post-77525847578807699882013-06-20T21:18:48.681+02:002013-06-20T21:18:48.681+02:00nice post. There is a typo in the polynomial kerne...nice post. There is a typo in the polynomial kernel. It should be k(x, y) = (x^Ty + c)^d. On a side not, what do you use to put equations on a blogspot blog.Deepak Roy Chittajalluhttp://www.blogger.com/profile/05910412379021494373noreply@blogger.comtag:blogger.com,1999:blog-7345806147365425073.post-58342225566099847962013-01-13T11:15:14.656+01:002013-01-13T11:15:14.656+01:00Thank you for your comments and thank you for shar...Thank you for your comments and thank you for sharing your experience.<br /><br />I knew there should be a way to find the embedding without SVD, I just didn't really have time to investigate. Could you please explain how the Cholesky decomposition might be used?<br /><br />The K-Means initialization of the prototype vectors is definitely something I should include in the implementation.Andreas Muellerhttp://www.blogger.com/profile/10177962095394942563noreply@blogger.comtag:blogger.com,1999:blog-7345806147365425073.post-12412392489886227532013-01-13T04:31:59.914+01:002013-01-13T04:31:59.914+01:00It's possible to implement the Nystrom map usi...It's possible to implement the Nystrom map using a diagonal-pivot Cholesky instead of the SVD. (The standard Cholesky might run into problems if the gram matrix of the "prototype" vectors is not PD.) In particular, using a Cholesky Crout, you can compute the gram matrix on the fly. The on-the-fly Cholesky also gives rise to a method to select the prototype vectors---though the K-means method seems to be much better in practice, but it can still be used to compute the feature map for unseen data once the prototype vectors are selected.Thomas Vacekhttp://www.blogger.com/profile/13476957969874261071noreply@blogger.comtag:blogger.com,1999:blog-7345806147365425073.post-37462311424780899602012-12-28T22:10:30.418+01:002012-12-28T22:10:30.418+01:00the map that is created such that the scalar produ...the map that is created such that the scalar product in the embedding space is approximately the kernel. maybe i should have said that more explicitly. so yes, you can use it for svr, kernel-pca, kernel-kmeans, anything. let me know how it goes!Andreas Muellerhttp://www.blogger.com/profile/10177962095394942563noreply@blogger.comtag:blogger.com,1999:blog-7345806147365425073.post-61461703609473323972012-12-28T21:51:27.486+01:002012-12-28T21:51:27.486+01:00Great analysis! I'll keep the kernel approxima...Great analysis! I'll keep the kernel approximation in mind next time I work with SVMs in sklearn. <br /><br />The kernel approximations just generate values approximating the higher-dimension space. So, in practice, they could also be used for SVR, right?Alexhttp://acompa.netnoreply@blogger.comtag:blogger.com,1999:blog-7345806147365425073.post-85463089252192267872012-12-28T09:06:34.816+01:002012-12-28T09:06:34.816+01:00I used LinearSVC as I did not want to mess with th...I used LinearSVC as I did not want to mess with the additional n_iter hyperparam of SGDClassifier and PassiveAggressiveClassifier. We should definitely implement early stopping for those models.<br /><br />LogisticRegression seems to yield similar performance as LinearSVC but is a bit slower to converge on this data.<br /><br />As for 3-nn on PCA data, this is an interesting datapoint but it does not compress the data very much and the prediction speed should be quite slow.<br /><br />Maybe running k-means with 100 centers per class to summarize the data and then running 3-nn versus the kmeans centers would yield good results, possibly after the soft-thresholded 1000 k-means transform.Olivier Griselhttp://www.blogger.com/profile/05751090858946703320noreply@blogger.comtag:blogger.com,1999:blog-7345806147365425073.post-51174341624331907402012-12-28T08:04:35.977+01:002012-12-28T08:04:35.977+01:00A very good work and many thanks,
A very good work and many thanks,<br />houchenghttp://www.blogger.com/profile/10605813285775865055noreply@blogger.comtag:blogger.com,1999:blog-7345806147365425073.post-43190534282554888872012-12-28T02:02:04.625+01:002012-12-28T02:02:04.625+01:00Sounds interesting :) Btw, if you do a pca to 30 d...Sounds interesting :) Btw, if you do a pca to 30 dimensions on all samples a 3-nn gets >98% ;)<br />I didn't experiment much as the exact SVM takes so long on all examples :-/ I rather wanted to get a general feel. You could also compare an exact kernel on less data with an approximate kernel on more data (given a certain computational budget).<br /><br />Did you use LinearSVC or SGD? Lower C generally makes it faster but I seemed to me less accurate.<br /><br />Btw, the gamma and C that I used are more or less optimal for the exact rbf. I don't have the models any more, though and I'm only on my laptop.Andreas Muellerhttp://www.blogger.com/profile/10177962095394942563noreply@blogger.comtag:blogger.com,1999:blog-7345806147365425073.post-36465650960259500372012-12-28T01:47:22.801+01:002012-12-28T01:47:22.801+01:00I experimented with an alternative kernel expansio...I experimented with an alternative kernel expansion base on soft-thresholded cosine similarities to 1000 k-means center on the PCA dim reduced samples:<br /><br />https://gist.github.com/4393530<br /><br />It yields ~96% acc on 20k MNIST samples in ~16s on my laptop. Not bad either.Olivier Griselhttp://www.blogger.com/profile/05751090858946703320noreply@blogger.comtag:blogger.com,1999:blog-7345806147365425073.post-18560547195385341472012-12-28T00:59:20.835+01:002012-12-28T00:59:20.835+01:00it would give less training time. not sure about a...it would give less training time. not sure about accuracy. this is only a subset of the training data.Andreas Muellerhttp://www.blogger.com/profile/10177962095394942563noreply@blogger.comtag:blogger.com,1999:blog-7345806147365425073.post-21193957891157353502012-12-28T00:55:52.247+01:002012-12-28T00:55:52.247+01:00How many support vectors do you get when using the...How many support vectors do you get when using the RBF SVC model (with grid searched C and gamma for optimal accuracy)?Olivier Griselhttp://www.blogger.com/profile/05751090858946703320noreply@blogger.comtag:blogger.com,1999:blog-7345806147365425073.post-45382610403195415652012-12-28T00:53:29.103+01:002012-12-28T00:53:29.103+01:00It seems that if you MinMaxScaler.fit_transform th...It seems that if you MinMaxScaler.fit_transform the data and then use C=0.01 (high regularizer) for the baseline LinearSVC model you can ~0.90 test error and faster training times. Olivier Griselhttp://www.blogger.com/profile/05751090858946703320noreply@blogger.comtag:blogger.com,1999:blog-7345806147365425073.post-88542528304083952172012-12-27T12:25:11.092+01:002012-12-27T12:25:11.092+01:00Thanks Olivier.
The code is here: https://gist.git...Thanks Olivier.<br />The code is here: https://gist.github.com/4387511<br />I fixed parameters for all runs to gamma = 0.31 and C=100 (for linear and kernelized SVM) - these are values that I know work for the full dataset and exact kernel.<br /><br />I just realized the approximate kernel SVMs used C=1. Damn. I guess I should run them again.<br /><br />The linear SVM that runs directly on the data has an accuracy of .857, so it is slightly below the graph.Andreas Muellerhttp://www.blogger.com/profile/10177962095394942563noreply@blogger.comtag:blogger.com,1999:blog-7345806147365425073.post-59278907129846107712012-12-27T12:15:10.252+01:002012-12-27T12:15:10.252+01:00Yes, thanks :)
Glad you like it. I wasn't sure...Yes, thanks :)<br />Glad you like it. I wasn't sure how easy it actually was to read ;)Andreas Muellerhttp://www.blogger.com/profile/10177962095394942563noreply@blogger.comtag:blogger.com,1999:blog-7345806147365425073.post-13902408879914995692012-12-27T09:05:33.666+01:002012-12-27T09:05:33.666+01:00Thank you for sharing. It is an easy read, like I ...Thank you for sharing. It is an easy read, like I like them. :)<br /><br />I guess you meant k(x,y) = (x y^T + c)^d (not x xT), right?EOL (Eric O LEBIGOT)http://www.blogger.com/profile/08843567382834026704noreply@blogger.comtag:blogger.com,1999:blog-7345806147365425073.post-30278216531808432682012-12-27T04:19:59.277+01:002012-12-27T04:19:59.277+01:00Very interesting work. I had started something sim...Very interesting work. I had started something similar in the past, trying to estimate the minimum dimension needed for a kernel to provide linear separability for a dataset. It didn't get that much attention. You can find it here http://arxiv.org/pdf/0810.4611<br />You might find it useful.nvasilhttp://www.blogger.com/profile/09970193239160690578noreply@blogger.comtag:blogger.com,1999:blog-7345806147365425073.post-22229521964252603292012-12-27T01:03:08.126+01:002012-12-27T01:03:08.126+01:00Nice post Andreas! On the first plot, what is the ...Nice post Andreas! On the first plot, what is the test error of the linear SVM model? What are the value of the hyperparameters? Do they change w.r.t. the number of extracted features?Olivier Griselhttp://www.blogger.com/profile/05751090858946703320noreply@blogger.com