As you hopefully have heard, we at scikit-learn are doing a user survey (which is still open by the way). One of the requests there was to provide some sort of flow chart on how to do machine learning. As this is clearly impossible, I went to work straight away. This is the result: [edit2] clarification: With ensemble classifiers and ensemble regressors I mean random forests , extremely randomized trees, gradient boosted trees , and the soon-to-be-come weight boosted trees (adaboost). [/edit2] Needless to say, this sheet is completely authoritative.
Last week I was at Pycon DE , the German Python conference. After hacking on scikit-learn a lot last week, I decided to to something different on my way back, that I had planned for quite a while: doing a wordl -like word cloud . I know, word clouds are a bit out of style but I kind of like them any way. My motivation to think about word clouds was that I thought these could be combined with topic-models to give somewhat more interesting visualizations. So I looked around to find a nice open-source implementation of word-clouds ... only to find none. (This has been a while, maybe it has changed since). While I was bored in the train last week, I came up with this code . A little today-themed taste:
Recently we added another method for kernel approximation, the Nyström method, to scikit-learn , which will be featured in the upcoming 0.13 release. Kernel-approximations were my first somewhat bigger contribution to scikit-learn and I have been thinking about them for a while. To dive into kernel approximations, first recall the kernel-trick .
[update] This post is a bit old, but many people still seem interested. So just a short update: Nowadays I would use Python and scikit-learn to do this. Here is an example of how to do cross-validation for SVMs in scikit-learn. Scikit-learn even downloads MNIST for you. [/update] MNIST is, for better or worse, one of the standard benchmarks for machine learning and is also widely used in then neural networks community as a toy vision problem. Just for the unlikely case that anyone is not familiar with it: It is a dataset of handwritten digits, 0-9, in black on white background. It looks something like this: There are 60000 training and 10000 test images, each 28x28 gray scale. There are roughly the same number of examples of each category in the test and training datasets. I used it in some papers myself even though there are some reasons why it is a little weird. Some not-so-obvious (or maybe they are) facts are: - The images actually contain a 20x20 patch of digi
Ever wanted to create an empty Python function in-line? No? Me neither. But my coworker does... We found the answer: <pre> x = lambda : None </pre> If you ever needed this, please tell me ;) [Edit] Apparently some people do need it :) Btw, most people seem to try Most people try lambda: pass first. But you have to remember, the thing at the right side of the colon is the return value of the function, not the body! So an expression is needed, not a statement. [\Edit]