Machine Learning Cheat Sheet (for scikit-learn)

January 25, 2013

As you hopefully have heard, we at scikit-learn are doing a user survey (which is still open by the way).
One of the requests there was to provide some sort of flow chart on how to do machine learning.

As this is clearly impossible, I went to work straight away.

This is the result:

[edit2]
clarification: With ensemble classifiers and ensemble regressors I mean random forests, extremely randomized trees, gradient boosted trees, and the soon-to-be-come weight boosted trees (adaboost).
[/edit2]

Needless to say, this sheet is completely authoritative.

Thanks to Rob Zinkov for pointing out an error in one yes/no decision.

More seriously: this is actually my work flow / train of thoughts whenever I try to solve a new problem. Basically, start simple first. If this doesn't work out, try something more complicated.
The chart above includes the intersection of all algorithms that are in scikit-learn and the ones that I find most useful in practice.

Only that I always start out with "just looking". To make any of the algorithms actually work, you need to do the right preprocessing of your data - which is much more of an art than picking the right algorithm imho.

Anyhow, enjoy ;)

[edit3]
You can find the SVG and dia file I used here. I doubt this qualifies as a creative work, but to make, I put this under CC0 license, which translates to "public domain" in the US.
[/edit3]
[edit]
As some people commented about structured prediction not being included in the chart: There is SVMstruct, which is a great library and has interfaces to many languages, but is only free for non-comercial use.
There is also the library I'm working on, pystruct, which I will write about on another day ;)

The chart is not really comprehensive, as I focused on scikit-learn. Otherwise I certainly would have included neural networks ;)
[/edit]

Comments

LinasJanuary 26, 2013 at 5:43 AM
"looking for structure" leads to "tough luck". I notice that program learning doesn't appear anywhere on the chart. Am I missing something?

(Disclaimer: I work on a program learning system, to learn structure.)
ReplyDelete
Replies
GaëlJanuary 26, 2013 at 11:39 AM
Fucking awesome.we need this in the scikit-learn documentation. This is an SVG edited with inkscape, right?
ReplyDelete
Replies
3NigmaJanuary 26, 2013 at 2:20 PM
Sweet ! ... Comprehensive indeed. Thanks!
ReplyDelete
Replies
AnonymousJanuary 27, 2013 at 7:26 AM
Is this based on eh... data?
ReplyDelete
Replies
AnonymousJanuary 27, 2013 at 11:37 AM
This is nice. For ease of reading, it would have been better to be consistent in choice of ">" or "<"
ReplyDelete
Replies
cast42January 27, 2013 at 2:17 PM
Why is the random forest technique not in this graph ?
ReplyDelete
Replies
UnknownJanuary 27, 2013 at 2:48 PM
Great! Very helpful.

Thank you.
ReplyDelete
Replies
xJanuary 27, 2013 at 4:35 PM
Thanks for posting this! As someone with a programming background but not really a machine learning background this is quite helpful.
ReplyDelete
Replies
UnknownJanuary 27, 2013 at 6:08 PM
Great effort, Thanks for sharing with all
ReplyDelete
Replies
UnknownJanuary 27, 2013 at 8:54 PM
Very nice. imho, there should be some consideration for the dimensionality and sparsity of the data, not just number of samples. In many (non-kernel based) algorithms, such as least squares regression, dimensionality is the limiting factor, and you only need to be able to make a single pass over the data.
ReplyDelete
Replies
CelesteJanuary 28, 2013 at 11:27 AM
Couldn't DBSCAN be used in the instance within clustering --> unknown number of categroies --> under 10K samples? Also, how do you know the order of magnitude of the sample size needed in all of these cases?
Thanks :)
ReplyDelete
Replies
AnonymousJanuary 29, 2013 at 4:39 PM
If you were going to include neural networks on your chart, about where would you put them?
ReplyDelete
Replies
AnonymousJanuary 30, 2013 at 1:40 AM
Hi,
Cool chart and info! Which tool did you use to create the chart?

Thanks!!
ReplyDelete
Replies
AnonymousJanuary 30, 2013 at 7:53 PM
Awesome...two paws up!
ReplyDelete
Replies
AnonymousFebruary 6, 2013 at 2:09 PM
Now make this into a (meta-)algorithm
ReplyDelete
Replies
Dijun LuoFebruary 15, 2013 at 5:48 PM
This is very interesting. But when an algorithm does not work, we can also try different ways to normalize the data before switching to another algorithm. Sometimes normalization helps a lot.

ReplyDelete
Replies
JesseFebruary 20, 2013 at 10:03 PM
This is totally useful for those of us who are getting started with kaggle.com competitions, thanks for making it!
ReplyDelete
Replies
Το Μπλε ΠαπούτσιMarch 9, 2013 at 10:56 AM
Great info!! :)

I have a problem though, despite of looking the picture...

I want to use any algorithm from weka at the following problem, but I do not know how should I preprocess my data, or one running well algorithm.

I have some data of the houses, like their size(in square meters), if they use aircondition, how many residents live in, I have their electricity consumption as well. I want to train any Machine Learning Algorithm to the dataset above, in order to create a model that estimates the houses consumption.

I tried many different algorithms (using weka), but I did not have good results. I was said that SVMs could solve this problem, with the right preprocessing. However, i did not have good results either.

Can anyone help me, in the way i should approach this problem, because I am really stuck?

Thanks in advance

ReplyDelete
Replies
UnknownApril 17, 2013 at 5:57 AM
I guess being bayesian is the tough luck case.. :P
ReplyDelete
Replies
AnonymousApril 22, 2013 at 1:40 AM
Hi,

I have a very big and resourceful data on an e-commerce site I run and I wanted to group my clients into frequency buyers and discover groups of preferences among them. I have some categories of products and they buy on one or more categories.

So far, most machine learning algorithms are explained in depth mathematically wise. But i am having trouble transforming my data into something to feed these algorithms.

In your post you said that preprocessing data is an art. Where can I read something useful to guide me through that ?

thanks in advance,
Bruno
ReplyDelete
Replies
AnonymousMay 17, 2013 at 11:18 PM
This is awesome and helpful.

By the way, what do you mean by "just looking", and why did it go to the dimension reduction? Does it mean that you only want have a overview on the data?
ReplyDelete
Replies
AnonymousJune 6, 2013 at 2:15 PM
In the lower left 'clustering' blob, the rightmost '<10K' question has 'yes' pointing to "Meanshift VBGMM", and 'no' to "Tough luck". Shouldn't that be the other way around? :)
ReplyDelete
Replies
UnknownJune 13, 2013 at 2:14 PM
Hello, This was very very helpful.By right processing of data can u please explain what do you mean by that because currently I am working on Text classification using Linear Regression . I have 20 categories . I am confused how my dataset should be . Can you please explain that part ? Thanks in advance
ReplyDelete
Replies
Conrad LeeJune 13, 2013 at 4:13 PM
MeanShift should scale well beyond 10k samples. I've used it on datasets with millions of samples (you need to use bin_seeding). Disclaimer: I sped up the meanshift implementaion a few years ago, so I'm slightly offended to see you label it as only appropriate for <10k samples :-)
ReplyDelete
Replies
Kaos BolaJuly 9, 2013 at 6:19 AM
Thanks for posting this! As someone with a programming background but not really a machine learning background this is quite helpful.
ReplyDelete
Replies
UnknownAugust 28, 2013 at 8:42 PM
Great post. What are the advantages to trying Linear SVC before Naive Bayes when working with Text Data?
ReplyDelete
Replies
AnonymousNovember 16, 2013 at 5:34 PM
Awesome
ReplyDelete
Replies
AnonymousSeptember 1, 2014 at 11:57 AM
This looks very similar to the map described on dlib C++ Machine Learning library page. http://dlib.net/ml_guide.svg
ReplyDelete
Replies
EvanJanuary 5, 2015 at 11:44 AM
It would have been better if you had drawn this from a flow chart tool like creately for the tutorial. Its ok though! Thanks for sharing..
ReplyDelete
Replies
AnonymousNovember 3, 2015 at 8:43 AM
This comment has been removed by the author.
ReplyDelete
Replies
Ahmed RagabJanuary 20, 2016 at 3:15 PM
This comment has been removed by the author.
ReplyDelete
Replies
EvanFebruary 8, 2016 at 6:06 AM
Just found this, please contribute these diagrams to creately diagram community.
ReplyDelete
Replies
UnknownNovember 3, 2017 at 1:25 PM
Everytime i want to use mglearn.discrete_scatter(X[:, 0], X[:, 1], y) to plot on my own data set whihc is made up of 17 features and target variable 0 & 1, i keep getting this error:
line 42, in mglearn.discrete_scatter(X[:, 0], X[:, 1], y_train)
File "C:\Users\DeJavu\Anaconda2\lib\site-packages\pandas\core\frame.py", line 2062, in __getitem__
return self._getitem_column(key)
File "C:\Users\DeJavu\Anaconda2\lib\site-packages\pandas\core\frame.py", line 2069, in _getitem_column
return self._get_item_cache(key)
File "C:\Users\DeJavu\Anaconda2\lib\site-packages\pandas\core\generic.py", line 1532, in _get_item_cache
res = cache.get(item)
TypeError: unhashable type
ReplyDelete
Replies