NIPS 2010 - Single Layer Networks in Unsupervised Feature Learning: The Deep Learning Killer [Edit: now available online!]

The paper "Single Layer Networks in Unsupervised Feature Learning" by Coates, Lee, Ng is one of the most interesting on this years NIPS in my opinion.
It's now available online! (pdf)
It follows a very simple idea: Compare "shallow" unsupervised feature extraction on image data using classification.
Datasets that are used are NORB and CIFAR 10, two of the most used datasets in the deep learning community.
Filters are learned on image patches and then features are computed in a very simple pyramid over the image.
These are then classified using a linear SVM. The approaches that are compared are:
  • K-Means
  • soft K-Means
  • Sparse Autoencoder
  • Sparse RBM
Here soft K-Means is an ad-hoc method that the authors thought up as being a natural extension of K-Means. It is a local coding based on the
k nearest neighbors of a point. A cross validation was performed to find the best patchsize, number of features and distance between sample points for features.
This does not seem so exciting so far. What is exciting are the results:
Not only does K-Means beat the other two feature extraction techniques. It also advanced well beyond the state of the art in both datasets.
Results are reported for 1600 features for the above mentioned algorithms except for Soft K-Means 4000 which indicates 4000 features.
For example results on CIFAR 10 are as follows:

These results are very surprising to say the least, since a lot of efford went into designing the LCC, Convolutional RBM and the mc-RBM. The latter two are both deep architectures which were quite probably optimized for this dataset and in particular the convolutional RBM comes from the same group as this work.

Other results include that denser sampling is better with features calculated at every position being the best. Also the performance increases with bigger filter sizes.

I talked to Honglak Lee who was at the poster about these results. He agreed that they are a blow to the head for the deep learning community. Carefully designed and trained deep architectures are outperformed by simple, shallow ones.
When I asked him about future directions of deep learning, Lee said that it should focus more on larger images and more complicated datasets.
I am not quite sure how deep architecture will cope with larger images but I am quite sure that deep learning has to switch its application if it wants to compete with other methods. On the other hand there is a lot more competition on realistic image data than on these datasets that were specifically designed for deep learning methods.

I would thank the authors for this great contribution. This is the sanity check of deep methods that has been missing for too long. But sadly they did not pass.


  1. I don't know that it's quite as big a blow as it sounds.

    Hinton has stated on numerous occasions that adding a new [appropriately sized] layer is guaranteed to improve a lower bound on the /reconstruction/ error, so I'd have expected there to be more information to be avaialble deeper in the network -- though perhaps much of that information is stored in the weights, and has been abstracted out, which may be the key to why they encountered these results.

    In the real world, I don't look at a car and conciously or subconciously examine every visible part to verify that it truly is a car. I've learned to recognize the general shape of a car, so if I'm given a vague outline of a car I'll still recognize it as such in spite of the lack of wheels, driver, or other details. In spite of the missing details, I'm still able to perform the classification task.

    Now, I've wondered for awhile why a deep -vs- shallow (or first layer) representation would make a difference on a classification task. We're not talking about extracting a boatload of information then performing a complex task with it, but rather funneling the extracted information into a far simpler task -- which bin does this piece of data belong?

    The 'how' is obviously not a simple task; comparatively speaking, I'm thinking on the order of more complex tasks.

    Using the human brain as an example: given the visual and audio inputs I've received, generate impulses that cause my arms to catch a ball that's been hit to the outfield (or some other complex task).

    A deep network allows the system to learn representations that build on each-other; eg, instead of neurons that just say "I can recognize blotches in these locations", deeper layer neurons are combining those blotches to form or recognize more complex representations of the world as the model sees it.

    That isn't a model for "Which bin should I put this picture in?", it's a model for far more complex tasks; it just so happens you can also use it for classification tasks.

    In his google tech talk, Michael Merzenich ( gives an intriguing description of how babies learn and it isn't at all unlike the process deep networks go through to be trained -- though the human brain is likely more appropriately modeled as an RNN than an FFNN.

    A baby spends a significant amount of time just learning representations for the world they live in, then begin correlating the pieces of information they've learned to parse in order to understand how things in their world interact and relate to eachother (eg, when mommy puts a spoon in my mouth, I will taste something I like).

    As I understand them, deeper layers give the network the ability to abstract away from the details and learn more complex combinations. I can see why the classification task worked well using a shallow representation, but in a very real sense there's more information available at that layer for the simple fact that the information hasn't been abstracted out by that point yet.


  2. I did not interpret this work as an attack on the idea of deep learning. It showed that you can do very well with this kind of architecture even if you don't go deep. However, that doesn't mean that you can't do even better if you find a way to extend it to a deep architecture.

    I was personally more impressed by the fact these useful features came out from such a beautiful, simple to implement, intuitive, biologically-plausible algorithm that requires almost no parameters, And that the training is extremely fast too, relatively speaking. These facts alone show that this kind of approach has merit.

    By the way, I would like to point out that I have started a google site + group for people who want to discuss this work and the surrounding issues further. Some of us have also conducted a few experiments on potential ways of extending this work, and shared some of the code. You can find it here:

  3. I didn't see it as an attack, I'm more or less trying to argue that all hope is not lost. I'd much rather have shorter training times with improved success rates.

    I guess my not-long-winded version is: just because classification [seems to be] better performed with a shallow representation doesn't mean deep learning isn't useful.


  4. @Andrej:
    The title definitely mirrors my reception of the work, not the intention by the authors. Since they have very popular ongoing work in Deep systems, they definitely don't want to attack them.
    But still this is rather surprising and also a little disappointing for many deep learning people.
    I also find the beauty and simplicity of this approach intriguing.
    The lesson that Honglak Lee took from this work (at least the one that he shared with me) is that the standard problems in the deep community are not that suitable to the deep approaches. I think that is also what Brian has been arguing.
    It is definitely promising to combine deep architectures with this new feature extraction approach. But it's quite uncertain if they will add much on this kind of data and task.

    Thanks for charing your site!


  5. I gave this some more thought, and I'm thinking it might be interesting to try another experiment that uses latent representations from all layers instead of just shallow or deep layers.

    Each layer represents an abstraction of the original information. I think it's also safe to say each deeper layer is a further level of abstraction.

    Therefore, each layer provides different insight into the raw data that while captured in the other latent representations is uniquely expressed in each layer.


  6. I think Honglak's take-away message makes a lot of sense. It is definitely my impression also that a deep architecture will become more important when dealing with issues on a more real-world scale-- issues that our brain has to deal with.

    I'm curious about what you think better datasets would look like, where deep methods should clearly have an advantage when done right. I thought about it for a while, and I have an opinion on where things should go, but I think maybe it's too crazy, and I'm not sure if it would be well received :D

    This is really great discussion though, and I would love to discuss more of this kind of stuff on the website I linked. Maybe if we can continue it there then others can choose to chip in too. It would also be nice to have a more central location for this kind of discussion, instead of having it spread around the internet :( I'll re-pose the question there.

  7. A couple of years ago I was doing work on the Walsh Hadamard transform. Combining them with random permutations to convert data into the 'Gaussian state'. I formulated a type of neural net based on the concept. I got really bored with it and dropped it for a while. Now maybe I will look at that whole area again.
    I should say that the code contains some loose ideas that I am playing with. I might decide for or against some of those ideas later.

  8. The Walsh Hadamard transform transforms a point into a sequency pattern. It is self-inverse therefore it transforms a sequency pattern to a point. The WHT is done using patterns of addition and subtraction (it is very fast). A paper by Wallace shows that the central limit theory applies to the output of a WHT. A fact I independently rediscovered around 2002/2003. I also showed that the WHT can be combined with random permutations to convert arbitrary numerical data into data with a Gaussian distribution. I further created a pure linear algebra neural net based on that. The exact learning capacity is 1 memory for 1 weight vector. However in higher dimensional space any similarity at all between 2 vectors is extremely unusual and basically cannot happen by chance. Hence even when the 1 memory for 1 weight vector limit is exceeded the output of the neural net I created is still very much closer to the target vector than could possibly happen by chance. On that basis I think the neural net I have created should be investigated further.


Post a Comment

Popular posts from this blog

Machine Learning Cheat Sheet (for scikit-learn)

A Wordcloud in Python

MNIST for ever....

Python things you never need: Empty lambda functions