NIPS 2010 - Label Embedding Trees for Large Multi-Class Tasks [edit]

Label Embedding Trees for Large Multi-Class Tasks by Samy Bengio, Jason Weston, David Gran is a paper about coping with classification in the presents huge amounts of data and many many classes. Of course the image net challenge comes to mind. But what they actually did is doing classification on all of image net, which is about 16.000 classes and over a million instances.
This work focuses on very fast recall but is very expensive to train - apparently the model was trained in a Google cluster.

This paper explores two ideas: hierarchical classification and low dimensional embeddings.
For the hierarchical classification part, a tree of label sets is build where the leaves are single classes.
Starting from all classes at the root, the set of classes is split into subsets similar classes which are represented by new nodes. This is done by training a one-vs-rest classifier and inspecting the confusion matrix. Classes that are easy to confuse are put into the same node.

At test-time, only log(n) many classifiers, each with only a few classes, have to be evaluated.
One obvious downside of the procedure is, that it assumes it is feasible to train many one-vs-rest classifiers for the root. [edit] It seems I misunderstood the procedure. Training of all one-vs-rest classifiers is used only for  the root. [/edit] If you have a cluster - no problem. If you don't, this is not for you.

The other part of the paper is about low dimensional embeddings for classifications. Here both the image features and the labels are jointly embedded into a low dimensional space using a linear transformation.
Then classification is done by finding the nearest label neighbour to a given embedded image. The idea is a straight forward and makes recall very fast.

The paper contains evaluations on both the tree structure with a simple classifier, the embedding method alone and a combination using the embedding as a classifier for the tree structure.

The paper also features an experiment where the actual category tree structure of image net is used instead of the learned tree. The results are quite bad. This is very expected since the semantic class hierarchy has not much to do with the visual appearances of objects and so very tough decisions are made very early.

I very much like the idea of learning a tree structure of labels on datasets with a lot of classes, but as I said before, this approach is very impractical if you have limited resources - like me with just about 25 CPUs and GPUs each.


  1. 2 remarks :
    - I think the one vs rest classifiers are trained only once. After you can use a submatrix of the original confusion matrix for every internal node if you subsequent splits.

    - with 25 cpus and GPUs, you can deal with even bigger problems (datasets) than those discibed in the paper if you parallelize your computations ...

  2. Thanks for your comment. I took a look at the paper and you are right about computing the confusion matrix only once. When I wrote the entry, I had only talked to Samy Bengio at the poster.
    Even though we talked for quite a while, it seems I have misunderstood the approach. Thanks for pointing that out.

  3. Hi,

    Thanks for your informative blog post.

    Actually I am also working on label tree embeddings method nowadays. Have you implemented it anyway? If yes, would you be considering of sharing it?

    Thanks again,


Post a Comment

Popular posts from this blog

Machine Learning Cheat Sheet (for scikit-learn)

A Wordcloud in Python

MNIST for ever....

Python things you never need: Empty lambda functions