One of the requests there was to provide some sort of flow chart on how to do machine learning.

As this is clearly impossible, I went to work straight away.

This is the result:

[edit2]

clarification: With ensemble classifiers and ensemble regressors I mean

**random forests**,

**extremely randomized trees, gradient boosted trees**, and the soon-to-be-come weight boosted trees (adaboost).

[/edit2]

Needless to say, this sheet is completely authoritative.

Thanks to Rob Zinkov for pointing out an error in one yes/no decision.

More seriously: this is actually my work flow / train of thoughts whenever I try to solve a new problem. Basically, start simple first. If this doesn't work out, try something more complicated.

The chart above includes the intersection of all algorithms that are in scikit-learn and the ones that I find most useful in practice.

Only that I

**always**start out with "just looking". To make any of the algorithms actually work, you need to do the

*right*preprocessing of your data - which is much more of an art than picking the right algorithm imho.

Anyhow, enjoy ;)

[edit3]

You can find the SVG and dia file I used here. I doubt this qualifies as a creative work, but to make, I put this under CC0 license, which translates to "public domain" in the US.

[/edit3]

[edit]

As some people commented about structured prediction not being included in the chart: There is SVMstruct, which is a great library and has interfaces to many languages, but is only free for non-comercial use.

There is also the library I'm working on, pystruct, which I will write about on another day ;)

The chart is not really comprehensive, as I focused on scikit-learn. Otherwise I certainly would have included neural networks ;)

[/edit]

"looking for structure" leads to "tough luck". I notice that program learning doesn't appear anywhere on the chart. Am I missing something?

ReplyDelete(Disclaimer: I work on a program learning system, to learn structure.)

I don't really know about program learning. The reason that "looking for structure" leads to "tough luck" is 1) I only listed algorithms that are implemented in sklearn 2) if you need a flow-chart to know what to do, I don't think you will be successful in working with structured models ;)

DeleteFucking awesome.we need this in the scikit-learn documentation. This is an SVG edited with inkscape, right?

ReplyDeleteI started out with dia and then did some modifications in inkscape. You can find both files here: https://gist.github.com/4642976 :)

Deletehi please give me any simple problem for doing ph.d in svm

DeleteSweet ! ... Comprehensive indeed. Thanks!

ReplyDeleteIs this based on eh... data?

ReplyDeleteWhich part do you mean? Most arrows tell you which class of algorithms to use. Mean shift doesn't take the number of clusters as parameter. That is a fact and doesn't need data.

DeleteThat stochastic and online algorithms scale to more data is based on runtime analysis. The number that I choose for branching is just a rule of thumb and naturally depends on the problem at hand.

The sequence of algorithms (Linear SVM -> KNN -> SVM)

is based on runtime complexity, the simplicity of the algorithm (you can read that as VC dimension if you want), the robustness and the ease of use.

So as far as I can see, the most arbitrary choice I did was which algorithms to include. As I said, that was governed by the availability in scikit-learn and my personal experience.

If you want more data, maybe look at the top then algorithms in data mining: http://www.cs.uvm.edu/~icdm/algorithms/10Algorithms-08.pdf they mostly select the same algorithms as I did (in areas that both they and me covered)

This is nice. For ease of reading, it would have been better to be consistent in choice of ">" or "<"

ReplyDeleteTrue. Well, I did that for all the important ones downstream, but missed the >50 one :-/

DeleteWhy is the random forest technique not in this graph ?

ReplyDeleteIt is. Only not very explicit. I'll correct this. The ensemble classifiers and ensemble regressors refer to Random forest, extremely randomized trees and gradient boosted trees (and weight boosted trees as soon as they are added).

DeleteGreat! Very helpful.

ReplyDeleteThank you.

Thanks for posting this! As someone with a programming background but not really a machine learning background this is quite helpful.

ReplyDeleteGreat effort, Thanks for sharing with all

ReplyDeleteVery nice. imho, there should be some consideration for the dimensionality and sparsity of the data, not just number of samples. In many (non-kernel based) algorithms, such as least squares regression, dimensionality is the limiting factor, and you only need to be able to make a single pass over the data.

ReplyDeleteYou are right, there are many other factors to consider. Clearly this is just a very rough guide.

DeleteIn my personal experience, I have never had a problem with large numbers of features. But that obviously depends on you setting. I worked with image data, where there are usually <10k features, and text data, which is very sparse.

Couldn't DBSCAN be used in the instance within clustering --> unknown number of categroies --> under 10K samples? Also, how do you know the order of magnitude of the sample size needed in all of these cases?

ReplyDeleteThanks :)

Yes, DBSCAN could be used there. I am not very experienced with that algorithm, though, and therefore it would not be my first choice - maybe it should be ;)

DeleteThere is now way to tell how many samples you need.

The number of samples are estimates where the algorithms are computationally feasible. If you try to run a kernel SVM on 500k samples, you will probably wait a week (if not run out of memory first) and then see that you didn't choose the right bandwidth gamma.

The <50 is a pretty arbitrary lower limit. Someone on reddit suggests "call a statistician" instead of getting more data, which definitely is an alternative. This is not really the regime for which these algorithms were designed.

Actually DBSCAN with a good index (k-d-tree, r-tree) should scale well into the millions of objects.

DeleteIf you were going to include neural networks on your chart, about where would you put them?

ReplyDeleteMy first impulse would be to put them into the SVC/SVR and ensemble method boxes for classification and regression. But actually they scale even to very large datasets.

DeleteArguably they are a bit harder to get right than SVMs and random forests. So maybe there would be arrows going from SVMs and kernel approximation going to NNs.

Hi,

ReplyDeleteCool chart and info! Which tool did you use to create the chart?

Thanks!!

Dia and Inkscape, though I wasn't very happy with that.

DeleteAwesome...two paws up!

ReplyDeleteThank you.

ReplyDeleteNow make this into a (meta-)algorithm

ReplyDeleteThis is very interesting. But when an algorithm does not work, we can also try different ways to normalize the data before switching to another algorithm. Sometimes normalization helps a lot.

ReplyDeleteThis is totally useful for those of us who are getting started with kaggle.com competitions, thanks for making it!

ReplyDeleteGreat info!! :)

ReplyDeleteI have a problem though, despite of looking the picture...

I want to use any algorithm from weka at the following problem, but I do not know how should I preprocess my data, or one running well algorithm.

I have some data of the houses, like their size(in square meters), if they use aircondition, how many residents live in, I have their electricity consumption as well. I want to train any Machine Learning Algorithm to the dataset above, in order to create a model that estimates the houses consumption.

I tried many different algorithms (using weka), but I did not have good results. I was said that SVMs could solve this problem, with the right preprocessing. However, i did not have good results either.

Can anyone help me, in the way i should approach this problem, because I am really stuck?

Thanks in advance

I guess being bayesian is the tough luck case.. :P

ReplyDeleteHi,

ReplyDeleteI have a very big and resourceful data on an e-commerce site I run and I wanted to group my clients into frequency buyers and discover groups of preferences among them. I have some categories of products and they buy on one or more categories.

So far, most machine learning algorithms are explained in depth mathematically wise. But i am having trouble transforming my data into something to feed these algorithms.

In your post you said that preprocessing data is an art. Where can I read something useful to guide me through that ?

thanks in advance,

Bruno

Hi Bruno. Unfortunately I don't know of any good resources on preprocessing. My best guess would be "Machine learning for Hackers", as this is more geared towards the practitioner. I haven't read it, though, and it is for R.

DeleteI heard rumors of some machine learning guru working on a scikit-learn book, that would also be more on the practical side. Not sure if this would come out this year, though.

For the moment, my best advise would be to get a machine learner / data scientist to work with you ;)

Cheers,

Andy

This is awesome and helpful.

ReplyDeleteBy the way, what do you mean by "just looking", and why did it go to the dimension reduction? Does it mean that you only want have a overview on the data?

Basically I use dimensionality reduction and manifold learning mainly for visualization. I project down to 2 dimensions and then do a scatter plot.

DeleteIn the lower left 'clustering' blob, the rightmost '<10K' question has 'yes' pointing to "Meanshift VBGMM", and 'no' to "Tough luck". Shouldn't that be the other way around? :)

ReplyDeleteErrr... no? With few examples you can do something, with more it is harder (though DBSCAN might do it, not in sklearn currently though).

DeleteHello, This was very very helpful.By right processing of data can u please explain what do you mean by that because currently I am working on Text classification using Linear Regression . I have 20 categories . I am confused how my dataset should be . Can you please explain that part ? Thanks in advance

ReplyDeleteMeanShift should scale well beyond 10k samples. I've used it on datasets with millions of samples (you need to use bin_seeding). Disclaimer: I sped up the meanshift implementaion a few years ago, so I'm slightly offended to see you label it as only appropriate for <10k samples :-)

ReplyDeleteOh, I'm sorry. Maybe I used the old implementation :-/ we should have an example of large-scale clustering in sklearn! If I find time, I'll play around with it.

DeleteThanks for posting this! As someone with a programming background but not really a machine learning background this is quite helpful.

ReplyDeleteGreat post. What are the advantages to trying Linear SVC before Naive Bayes when working with Text Data?

ReplyDeleteThanks :)

DeleteYou got me, there is not really a reason. Probably the take-away message is that I just try Linear SVC before thinking about the problem at all ;)

Naive Bayes might be a more natural first choice, but would make the graph slightly more complicated (It would have Linear SVC in two places). I would say I'd prefer algorithms that are less sensitive to hyper-parameters more but I have too little experience with NB to know if it is more sensitive to alpha than LinearSVC is to C.

Awesome

ReplyDeleteThis looks very similar to the map described on dlib C++ Machine Learning library page. http://dlib.net/ml_guide.svg

ReplyDeleteYeah, Daniel said he liked the idea so he created one for his lib, too :)

DeleteI don't know about you guys, but this is the most complex flowchart diagram I've ever seen. However I did not see that it was drawn with flowchart symbols . Still its legit

ReplyDeleteIt would have been better if you had drawn this from a flow chart tool like creately for the tutorial. Its ok though! Thanks for sharing..

ReplyDeleteIt is drawn with a flow chart tool, dia ;)

DeleteThis comment has been removed by the author.

ReplyDelete. I just wanted to g Arsip JRB Multipro Indonesia tanya jawab bangunan portofolio jasarumahbangunan.com testimoni jasarumahbangunan.com Blog JRB Jasa Rumah Bangunan jasa bangun rumah jasa desain jasa desain arsitektur jasa desain furniture jasa kontraktor jasa kontraktor contblock jasa kontraktor cor hotmix jasa kontraktor drainase jasa kontraktor instalasi listrik jasa kontraktor landscape jasa kontraktor waterproofing jual material bangunan jual baja ringan jual bata Jasa Bangun Rumah

ReplyDeletemunicipal election. dijual rumah di bsd dijual rumah di gading serpong dijual rumah di pondok indah dijual rumah di puri bintaro dijual rumah di serpong dijual rumah di tangerang dijual rumah bsd nusa loka dijual rumah di alam sutera dijual rumah di bintaro dijual rumah di bintaro jaya dijual rumah di bsd dijual rumah di gading serpong dijual rumah di pondok indah dijual rumah di puri bintaro dijual rumah di serpong dijual rumah di tangerang daftar harga rumah di alam sutera daftar harga rumah minimalis dijual 2 dijual rumah anggrek loka bsd

ReplyDeleteVery nice post, impressive. its quite different from other posts. Thanks for sharing.

ReplyDeletebuy weighing scale online

شركة تنظيف خزانات بجدة

ReplyDeleteتنظيف خزانات بجدة

شركة مكافحة حشرات بجدة

مكافحة حشرات بجدة

شركة نقل اثاث بجدة

نقل اثاث بجدة

شركة نقل عفش بجدة

نقل عفش بجدة

thnaks alot for that valuable info

ReplyDeletejeddah transfer furniture company , jeddah cleanning tanks company , jeddah pest control company

شركة نور الجنة

ReplyDeleteشركة تنظيف بمكة شركتنا تقدم لكم افضل شركة تنظف فلل بمكة ونحن نعمل على تنظيف فلل والكنب

والمجالس من حيث الاجهزة الحديثة ونحن الافضل فى مكة فى كشف تسربات بمكة باحدث الاجهزة الالكترونية وان شركة نور الجنة تعمل على تنظيف خزانات فى عدة من الاشياء اولا ان تتصل بشركة نور الجنة هى الشركة الوحيدة فى مكافحة حشرات وهى الشركة الممتازة ...

شركة تنظيف بمكة

شركة تنظيف سجاد بمكة

شركة رش مبيدات بمكة

شركة مكافحة حشرات بمكة

شركة نقل اثاث بمكة

شركة عزل خزانات بمكة

شركة كشف تسربات بمكة

شركة تنظيف خزانات بمكة

شركة تسليك مجارى بمكة

شركة تنظيف منازل بمكة

شركة تنظيف مجالس بمكة

شركة تنظيف شقق بمكة

شركة تنظيف فلل بمكة

ReplyDeleteشركة نقل عفش بالمدينة المنورة

شركة تنظيف بالمدينة المنورة

شركة نقل اثاث بالمدينة المنورة

شركة تسليك مجارى بجدة

شركة تسليك مجارى بالرياض

شركة تنظيف كنب بالرياض

شركة تنظيف واجهات بالرياض

شركة عزل خزانات بالرياض

شركة تنظيف بيوت بالرياض

شركة تنظيف بالرياض

شركة مكافحة بق ورش بق بالرياض

شركة تنظيف مسابح بالرياض

شركة مكافحة حشرات بالاحساء

شركة تنظيف بالدمام

ReplyDeleteشركة تنظيف شقق بالدمام

شركة كشف تسربات بالدمام

شركة تنظييف خزانات بالدمام

شركة رش مبيدات بالدمام

شركة رش مبيد بالدمام

شركة تسليك مجارى بالدمام

شركة عزل خزانات بالدمام

شركة مكافحة حشرات بالدمام

شركة نقل اثاث بالدمام

شركة تنظيف فلل بالدمام

شركة تنظيف منازل بالدمام

شركة تنظيف سجاد وموكيت بالدمام

شركة مكافحة بق بالدمام والاحساء

شركة عزل اسطح بالدمام

شركة شفط بيارات بالدمام

شركة مكافحة النمل الابيض بالدمام

شركة تنظيف كنب بالدمام

شركة تنظيف مسابح بالدمام

This comment has been removed by the author.

ReplyDeleteشركة تسليك مجارى بالرياض

ReplyDeleteشركة تسليك مجارى المطابخ بالرياض

شركة تسليك مجارى الحمامات بالرياض

شركة شفط بيارات بالرياض

شركة تنظيف بالرياض

شركة تنظيف فلل بالرياض

شركة تنظيف شقق بالرياض

شركة تنظيف مجالس بالرياض

شركة تنظيف بالرياض

افضل شركة تنظيف بالرياض

شركات تنظيف بالرياض

شركة المثالى للتنظيف بالدمام

ReplyDeleteشركة تنظيف بالدمام

شركة تنظيف بسيهات

شركة تنظيف بالخبر

شركة تنظيف كنب بالدمام

شركة تسليك مجارى بالدمام

شركة مكافحة حشرات بسيهات

شركة نقل عفش بالدمام

عمليات التنظيف الجادة والمهمة عملية ليست سهلة علي الاطلاق تنفذ بالجهد الذي يعطي عملية تنظيف ناجحة ققد نتاج خبرة عملاقة كافية لدي جميع مدن المملكة المثالي اسم له تاريخ عاشت من اجل الحفاظ علي البيئة التنظيف بالطرق التقليدية لايصبح مجدي علي الاطلاق هناك طرق اخري بواسطة محترفين لديها قدرة علي قيادة حملة تنظيف ناجحة قد تكون انت شاكر لذلك عزيزي العميل اما ان يكون هناك حس مباشر ام ان يكون احترافية في التعامل مع التنظيف عزيزي العميل انت في ايد امينة ما دمت قاددر علي شن حملات قوية للتنظيف الجاد بالدمام مع شركة النالي لخدمات التنظيف لاتقلق بشأن ذللك ابدا فنحن نحاول ان نعيد البريق واللمعان لبيتك او فللتك انت تنظر جيدا كيف يتم الاتيان بشركة وهمية ليس لديها معرفة او قدرة علي عمل نظافة للمكان لكن مع المثالي انت في امان الشركة لديها عمالة مدربة لديها خدمات لديها اجهزة لديها عمليات تنظيف ناشطة لديها سيارات مصممة علي الطراز الحدث وملائمة للمواصفات انت مع المثالي في امان.

ReplyDeleteشركة المثالى للتنظيف بالدمام

شركة تنظيف بالدمام

شركة تنظيف بسيهات

شركة تنظيف بالخبر

شركة تنظيف كنب بالدمام

شركة تسليك مجارى بالدمام

شركة مكافحة حشرات بسيهات

شركة نقل عفش بالدمام

thanks this is good blog. front load washing machine service

ReplyDelete