Don't cite the No Free Lunch Theorem

Tldr; You probably shouldn’t be citing the "No Free Lunch" Theorem by Wolpert. If you’ve cited it somewhere, you might have used it to support the wrong conclusion. What it actually (vaguely) says is “You can’t learn from data without making assumptions”.

The paper on the “No Free Lunch Theorem”, actually called "The Lack of A Priori Distinctions Between Learning Algorithms" is one of these papers that are often cited and rarely read, and I hear many people in the ML community refer to it when supporting the claim that “one model can’t be the best at everything” or “one model won’t always be better than another model”. The point of this post is to convince you that this is not what the paper or theorem says (at least not the one usually cited by Wolpert), and you should not cite this theorem in this context; and also that common versions cited of the "No Free Lunch" Theorem are not actually true.

Multiple Theorems, one Name

The first problem is that there are at least two theorems with the name “no free lunch” that I know about. One by Wolpert, first published in The Lack of A Priori Distinctions Between Learning Algorithms (there's actually multiple but they go in the same direction), and one in Understanding Machine Learning By Shalev-Shwarz and Ben-David (a very excellent book!). Wolpert also published a “No Free Lunch in Optimization”, but I'm only concerned with the theorem for supervised learning. The Theorem in Understanding Machine Learning is actually quite different, and I’ll discuss it below. I think the goal of Shalev-Shwarz and Ben-David was to formalize the folk wisdom of the “no free lunch” theorem in a different way than Wolpert did, and their theorem actually says “One model can’t always win”, though in a very specific way (and if you cite this one, there’s other caveats, see below). In a way, the theorem they present is much clearer in what we actually learn from it. But I’m not sure giving it a name that was already taken was a good idea.

So what does it say?

The main statement of the original paper can be summarized as “Under the assumptions of the theorem, any two prediction functions have the same ability to generalize”. There are two crucial parts to this: the assumptions and the conclusions. Let’s start with trying to understand the conclusion. It is often read to mean something like “Gradient Boosting can’t always win”. Instead, what it actually says is that “Gradient boosting is as good as always predicting the most frequent class”. Or, “Neural networks are as good as predicting the least frequent class.” Clearly these statements don’t correspond to our actual experience in machine learning practice. Always predicting the least frequent class is clearly a terrible strategy. But according to the theorem, it’s as good as the best state-of-the-art model you can find with respect to generalization properties. So what’s happening here?

The Assumptions

The key to understanding the theorem is in understanding the assumptions of the theorem. The theorem does not use one of the most commonly used assumptions in machine learning theory, which is that the data is drawn i.i.d. from some given distribution. Instead, Wolpert assumes the data is a finite set, and training and test set are disjoint and drawn from a discrete distribution. This does sound reasonable; in practice our data is always finite, and we want to generalize to new data that we haven’t seen before. Making these assumptions allow Wolpert to average over all possible datasets. The statement of the theorem is comparing two algorithms over all possible datasets generated using these assumptions.
While these assumptions sound reasonable for doing machine learning, they really are not. What these assumptions are saying is that
a) the test data and the training data are statistically independent, i.e. the test data has nothing to do with the training data, and
b) the labels have nothing to do with the features (because we are averaging over all possible labelings). Phrasing it this way, these assumptions are clearly not good for doing any predictive modelling.

So what does it mean?

Now that we revisited both conclusion and assumptions, let’s try to summarize, or maybe rephrase the No Free Lunch theorem by Wolpert. The conclusion includes “any model is as good as predicting the minority class”. What that really says is that “learning is impossible”. Given our understanding of the assumptions, the full statement of the Theorem is something like: “If the training set has nothing to do with the test set, and the features have nothing to do with the labels, learning is impossible”. This statement makes intuitive sense, but it is very far from what I hear commonly associated with the theorem. The most sensible reading of the theorem is “You need to make assumptions in order for learning to be possible”. However, what it really shows is that if you make the specific assumption of the theorem, learning is not possible. So if you want to claim generally that “learning requires assumptions”, I don’t think you should cite this paper.

What (I think) Wolpert’s point was

I think the point of the paper is to challenge the i.i.d. assumption. Wolpert gives (good) reasons why he thinks it’s not appropriate, and why machine learning theory should explore other frameworks. In particular, there is a good case to be made for modeling datasets as being finite. If this is the case, then the i.i.d. assumption would allow points to be both in the training and the test set. Clearly that’s not the point of generalization. So Wolpert requires training and test set to be disjoint, which also makes sense. However, the consequence is that training and test set (and their labels) are completely independent now, which is strange. I don’t know if he actually thought this was a good framework for machine learning. I assume his motivation was to challenge the field to revisit assumptions and find alternatives to the i.i.d. assumption that more closely resemble machine learning practice. Many years later, it seems to me there was an unfortunate lack of follow up from the rest of the community, and a clear misunderstanding of the theorem by many machine learning practitioners.

The other “No Free Lunch Theorem”

As I mentioned, there’s another “No Free Lunch Theorem”. It’s quite different in that it does use an i.i.d. assumption for evaluating a model. In other ways it’s quite similar in that it makes use of the fact that without additional assumptions, if you see part of the data, the remaining part could have any possible labeling. More concretely, the theorem says “For any predictive algorithm there is a dataset on which it fails, i.e. a dataset on which a different learner would perform better”. However, this does not prevent statements like “Algorithm X is always better than Algorithm Y”, because the algorithm that does better is not realizable (it's the algorithm that produces the true answer for this dataset without looking at the data). Under this framework I’m sure you could easily prove that in an imbalanced dataset, it’s better (in expectation) to predict the majority class than to predict the minority class, a statement that is not true in Wolpert’s framework.

What to Cite

I think there are very few cases when citing Wolpert supports whatever point you’re making. If your point is “No model can always be best”, I would suggest citing Shalev-Shwartz and Ben-David. If your point is “Learning is impossible without proper assumptions”, you might cite the whole chapter by Shalev-Shwartz and Ben-David. I’m not sure there’s a good way to make this statement in a formal way. You could cite Wolpert if you really want to, but I think that might be more confusing than helpful. If your point is “The i.i.d. assumption is weird in the presence of finite data”, definitely cite Wolpert!
Finally, if your point is “Gradient boosting can’t always be better than neural networks because of the No Free Lunch Theorem”, then, as far as I know, you’re wrong, and there is nothing that would prohibit a statement of this form. I don’t believe that there’s many strict “always worse” or “always better” relationships between commonly used ML algorithms, but I’m also not aware of any theoretical reason that would prevent such statements to exist (in a framework where learning is possible). And as I said above, predicting the majority class is always better than predicting the minority class, for a reasonable definition of “always”.

Learning More

If you’re interested in learning more about Machine Learning theory, I think the book by Shalev-Shwartz and Ben-David is actually quite excellent. I also really enjoyed “Foundations of Machine Learning” by Mehryar Mohri, Afshin Rostamizadeh and Ameet Talwalkar. Both books are available as PDFs on the author websites I linked to. I’m by no means a theory person, but I think having some background on machine learning theory helps provide a framework in reasoning about algorithms. Of course you can also check out Wolpert’s paper, though I think that’s mostly interesting if you want to learn why he doesn’t like the i.i.d. Assumption - so it’s more about philosophy of machine learning theory than about standard machine learning theory.


Popular posts from this blog

Machine Learning Cheat Sheet (for scikit-learn)

A Wordcloud in Python

MNIST for ever....