ICCV! In Barcelona! Regrettably, I had to stay home in cold Bonn.
Today, I went through the accepted papers, and one of the many I found interesting was "Ask the locals: multi-way local pooling for image recognition" by Y-Lan Boureau, Nicolas Le Roux, Francis Bach, Jean Ponce and Yann LeCun.
Many big names on this one :)
In this work the authors highlight a feature of many recent coding algorithms for visual descriptors: locality in the feature space.
They formulate the encoding as a maximum pooling operation that is local in an image as well as in features space, by using a coarse k-means clustering on features (that are histograms of sparse codes if I understood correctly).
The paper reports very good results on Caltech 101 and 256, and the scenes dataset. In particular, good results are achieved with quite small dictionaries, i.e. of size 256.
My colleague Hannes pointed out that the features space binning is basically a layer of an RBF network. Which is not mentioned in the paper at all.
Well, RBF networks make softer decisions, but use the same basic idea.
Maybe we should start using them again ....
However, there is a clear take home message from this paper: Don't pool stuff that is different, just because it is close together.
This point was made before in the literature, but definitely not as strongly.