Posts

Don't fund Software that doesn't exist

I’ve been happy to see an increase in funding for open source software across research areas and across funding bodies. However, I observed that a majority of funding from, say, the NSF, goes to projects that do not exist yet, and where the funding is supposed to create a new project, or to extend projects that are developed and used within a single research lab. I think this top-down approach to creating software comes from a misunderstanding of the existing open source software that is used in science. This post collects thoughts on the effectiveness of current grant-based funding and how to improve it from the perspective of the grant-makers. Instead of the current approach of funding new projects, I would recommend funding existing open source software, ideally software that is widely used, and underfunded. The story of the underfunded but critically important open source software (which I’ll refer to as infrastructure software) should be an old tale by now. If this is news to yo…

Don't cite the No Free Lunch Theorem

Tldr; You probably shouldn’t be citing the "No Free Lunch" Theorem by Wolpert. If you’ve cited it somewhere, you might have used it to support the wrong conclusion. What it actually (vaguely) says is “You can’t learn from data without making assumptions”.

The paper on the “No Free Lunch Theorem”, actually called "The Lack of A Priori Distinctions Between Learning Algorithms" is one of these papers that are often cited and rarely read, and I hear many people in the ML community refer to it when supporting the claim that “one model can’t be the best at everything” or “one model won’t always be better than another model”. The point of this post is to convince you that this is not what the paper or theorem says (at least not the one usually cited by Wolpert), and you should not cite this theorem in this context; and also that common versions cited of the "No Free Lunch" Theorem are not actually true.
Multiple Theorems, one Name The first problem is that ther…

Off-topic: speed reading like spritz

As the title suggests, this is a non-machine-learning, non-vision, non-python post *gasp*.
Some people in my network posted about spritz a startup that recently went out of stealth-mode. They do a pretty cool app for speed reading. See this huffington post article for a quick demo and explanation.
They say they are still in development, so the app is not available for the public.

The app seemed seems pretty neat but also pretty easy to do. I said that and people came back with "they probably do a lot of natural language processing and parsing the sentence to align to the proper character" and similar remarks.
So I reverse engineered it. By which I mean opening the demo gifs in Gimp and counting letters. And, surprise surprise: they just count letters. So the letter they highlight (at least in the demo) is only depending on the letter of the word.
The magic formula is
highlight = 1 + ceil(word_length / 4) . They might also be timing the transitions differently, haven't re…

Scikit-learn sprint and 0.14 release candidate (Update: binaries available :)

Image
Yesterday a week-long scikit-learn coding sprint in Paris ended.
And let me just say: a week is pretty long for a sprint. I think most of us were pretty exhausted in the end. But we put together a release candidate for 0.14 that Gael Varoquaux tagged last night.

You can install it via:
pip install -U https://github.com/scikit-learn/scikit-learn/archive/0.14a1.zip

There are also tarballs on github and binaries on sourceforge.

If you want the most current version, you can check out the release branch on github:
https://github.com/scikit-learn/scikit-learn/tree/0.14.X

The full list of changes can be found in what's new.

The purpose of  the release candidate is to give users a chance to give us feedback before the release. So please try it out and report back if you have any issues.

ICML 2013 Reading List

The ICML is now already over for two weeks, but I still wanted to write about my reading list, as there have been some quite interesting papers (the proceedings are here). Also, I haven't blogged in ages, for which I really have no excuse ;)

There are three topics that I am particularly interested in, which got a lot of attention at this years ICML: Neural networks, feature expansion and kernel approximation, and Structured prediction.

pystruct: more structured prediction with python

Image
Some time ago I wrote about a structured learning project I have been working on for some time, called pystruct.
After not working on it for some time, I think it has come quite a long way the last couple of weeks as I picked up work on structured SVMs again. So here is a quick update on what you can do with it.

To the best of my knowledge this is the only tool with ready-to-use functionality to learn structural SVMs (or max-margin CRFs) on loopy graphs - even though this is pretty standard in the (computer vision) literature.

Machine Learning Cheat Sheet (for scikit-learn)

Image
As you hopefully have heard, we at scikit-learn are doing a user survey (which is still open by the way).
One of the requests there was to provide some sort of flow chart on how to do machine learning.

As this is clearly impossible, I went to work straight away.

This is the result:



[edit2]
clarification: With ensemble classifiers and ensemble regressors I mean random forests, extremely randomized trees, gradient boosted trees, and the soon-to-be-come weight boosted trees (adaboost).
[/edit2]


Needless to say, this sheet is completely authoritative.