As you hopefully have heard, we at scikit-learn are doing a user survey (which is still open by the way).
One of the requests there was to provide some sort of flow chart on how to do machine learning.
As this is clearly impossible, I went to work straight away.
This is the result:
clarification: With ensemble classifiers and ensemble regressors I mean random forests, extremely randomized trees, gradient boosted trees, and the soon-to-be-come weight boosted trees (adaboost).
Needless to say, this sheet is completely authoritative.
Last week I was at Pycon DE, the German Python conference. After hacking on scikit-learn a lot last week, I decided to to something different on my way back, that I had planned for quite a while:
doing a wordl-like word cloud.
I know, word clouds are a bit out of style but I kind of like them any way. My motivation to think about word clouds was that I thought these could be combined with topic-models to give somewhat more interesting visualizations.
So I looked around to find a nice open-source implementation of word-clouds ... only to find none. (This has been a while, maybe it has changed since).
While I was bored in the train last week, I came up with this code.
A little today-themed taste:
Recently we added another method for kernel approximation, the Nyström method,
to scikit-learn, which will be featured in the upcoming 0.13 release.
Kernel-approximations were my first somewhat bigger contribution to
scikit-learn and I have been thinking about them for a while.
To dive into kernel approximations, first recall the kernel-trick.
[update]This post is a bit old, but many people still seem interested. So just a short update:
Nowadays I would use Python and scikit-learn to do this. Here is an example of how to do cross-validation for SVMs in scikit-learn.Scikit-learn even downloads MNIST for you. [/update]
MNIST is, for better or worse, one of the standard benchmarks for machine learning and is also widely used in then neural networks community as a toy vision problem.
Just for the unlikely case that anyone is not familiar with it:
It is a dataset of handwritten digits, 0-9, in black on white background.
It looks something like this:
There are 60000 training and 10000 test images, each 28x28 gray scale.
There are roughly the same number of examples of each category in the test and training datasets.
I used it in some papers myself even though there are some reasons why it is a little weird.
Some not-so-obvious (or maybe they are) facts are:
- The images actually contain a 20x20 patch of digit and where padded to …
Tldr; You probably shouldn’t be citing the "No Free Lunch" Theorem by Wolpert. If you’ve cited it somewhere, you might have used it to support the wrong conclusion. What it actually (vaguely) says is “You can’t learn from data without making assumptions”.
The paper on the “No Free Lunch Theorem”, actually called "The Lack of A Priori Distinctions Between Learning Algorithms" is one of these papers that are often cited and rarely read, and I hear many people in the ML community refer to it when supporting the claim that “one model can’t be the best at everything” or “one model won’t always be better than another model”.
The point of this post is to convince you that this is not what the paper or theorem says (at least not the one usually cited by Wolpert), and you should not cite this theorem in this context; and also that common versions cited of the "No Free Lunch" Theorem are not actually true.
Multiple Theorems, one Name
The first problem is that ther…