Another look at MNIST
I'm a bit obsessed with MNIST.
Mainly because I think it should not be used in any papers any more - it is weird for a lot of reasons.
When preparing the workshop we held yesterday I noticed one that I wasn't aware of yet: most of the 1-vs-1 subproblems, are really easy!
Basically all pairs of numbers can be separated perfectly using a linear classifier!
And even you you just do a PCA to two dimensions, they can pretty much still be linearly separated! It doesn't get much easier than that. This makes me even more sceptical about "feature learning" results on this dataset.
To illustrate my point, here are all pairwise PCA projections. The image is pretty huge. Otherwise you wouldn't be able to make out individual data points.
You can generate it using this very simple gist.
There are some classes that are not obviously separated: 3 vs 5, 4 vs 9, 5 vs 8 and 7 vs 9. But keep in mind, this is just a PCA to two dimensions. It doesn't mean that they couldn't be separated linarly in the original space.
Interestingly the "1"s are very easy to identify, even with seven and nine there is basically no way to confuse them. The ones have a somewhat peculiar shape, though. It would be fun to see what a tour along the "bow" (see img at [2, 2]) would look like.
Manifold-people should be delighted ;)
I think this plot emphasizes again: look at your data!
I hope you enjoyed this perspective.
Mainly because I think it should not be used in any papers any more - it is weird for a lot of reasons.
When preparing the workshop we held yesterday I noticed one that I wasn't aware of yet: most of the 1-vs-1 subproblems, are really easy!
Basically all pairs of numbers can be separated perfectly using a linear classifier!
And even you you just do a PCA to two dimensions, they can pretty much still be linearly separated! It doesn't get much easier than that. This makes me even more sceptical about "feature learning" results on this dataset.
To illustrate my point, here are all pairwise PCA projections. The image is pretty huge. Otherwise you wouldn't be able to make out individual data points.
You can generate it using this very simple gist.
There are some classes that are not obviously separated: 3 vs 5, 4 vs 9, 5 vs 8 and 7 vs 9. But keep in mind, this is just a PCA to two dimensions. It doesn't mean that they couldn't be separated linarly in the original space.
Interestingly the "1"s are very easy to identify, even with seven and nine there is basically no way to confuse them. The ones have a somewhat peculiar shape, though. It would be fun to see what a tour along the "bow" (see img at [2, 2]) would look like.
Manifold-people should be delighted ;)
I think this plot emphasizes again: look at your data!
I hope you enjoyed this perspective.
Nice work and observation.
ReplyDeleteIt happens in classification practices that different algorithms often yield similar accuracy measures. Most of the time algorithm doesn't matter as much.
I know it's a quick hack, but a much harder problem on MNIST is to separate two classes: digits 0-4 versus digits 5-9. Just give it a try!
ReplyDeleteThe same test was done on USPS by Chapelle, in the now-classic paper "Training a S.V.M. in the Primal".
Check this! http://arxiv.org/abs/1301.3342 (Barnes-Hut SNE: Laurens van der Maaten)
ReplyDeleteall 70,000 non-linear DR, among other datasets!
What we see from here is that _most_ of the samples are easily separable, but not all of them. This is very common in practice that given some realistic dataset you can easily reach 90% recognition rate with the simplest algorithm you have, but every next percent is very difficult to achieve.
ReplyDeleteI think it's a valid point. MNIST is really not that challenging compared to a "generic" classification problem, because the classes form nice clusters. I am looking for a problem that had more convex regions, and that is probably what happens when you aggregate 0-4 and 5-9 as pointed out before.
ReplyDeleteOne important thing to notice here is that you re-calculated the PCA for each pair here. So you found the one single optimal direction to classify between each pair. But what that means in the original domain is that you need all the 10*9 = 90 optimal features to make your classification. And indeed, 90 components seem to be the point you start to "explain" less than 90% of the variance. Having even more possible negative samples also probably makes the problem a little more challenging, e.g. mixing some letters or CIFAR images. It's more noise to filter out...