Label Embedding Trees for Large Multi-Class Tasks by Samy Bengio, Jason Weston, David Gran is a paper about coping with classification in the presents huge amounts of data and many many classes. Of course the image net challenge comes to mind. But what they actually did is doing classification on all of image net, which is about 16.000 classes and over a million instances.
This work focuses on very fast recall but is very expensive to train - apparently the model was trained in a Google cluster.
This paper explores two ideas: hierarchical classification and low dimensional embeddings.
For the hierarchical classification part, a tree of label sets is build where the leaves are single classes.
Starting from all classes at the root, the set of classes is split into subsets similar classes which are represented by new nodes. This is done by training a one-vs-rest classifier and inspecting the confusion matrix. Classes that are easy to confuse are put into the same node.
At test-time, only log(n) many classifiers, each with only a few classes, have to be evaluated.
One obvious downside of the procedure is, that it assumes it is feasible to train many one-vs-rest classifiers for the root.  It seems I misunderstood the procedure. Training of all one-vs-rest classifiers is used only for the root. [/edit] If you have a cluster - no problem. If you don't, this is not for you.
The other part of the paper is about low dimensional embeddings for classifications. Here both the image features and the labels are jointly embedded into a low dimensional space using a linear transformation.
Then classification is done by finding the nearest label neighbour to a given embedded image. The idea is a straight forward and makes recall very fast.
The paper contains evaluations on both the tree structure with a simple classifier, the embedding method alone and a combination using the embedding as a classifier for the tree structure.
The paper also features an experiment where the actual category tree structure of image net is used instead of the learned tree. The results are quite bad. This is very expected since the semantic class hierarchy has not much to do with the visual appearances of objects and so very tough decisions are made very early.
I very much like the idea of learning a tree structure of labels on datasets with a lot of classes, but as I said before, this approach is very impractical if you have limited resources - like me with just about 25 CPUs and GPUs each.