It's

**explicit feature maps**. So what is this about?

We all know and love the kernel trick: have some vector representation X of your data and some algorithm (like SVMs) that rely only on dot products in the input space. Replace the dot products by some positive definite kernel and by Mercer's Theorem there is some Hilbert space of functions Y and some mapping

phi: X -> Y such that your kernelized algorithm works as if you mapped your input to phi.

This extremely powerfull method is now used everywhere, most of all for SVMs.

But it has several significant drawbacks in this case:

- Training using a non-linear kernel is a lot slower [O(n^3)] than training a linear SVM [O(n)].
- Kernel SVMs take a lot of memory to store, since they are "non-parametric" in a sense. In addition to the Lagrange-Multipliers alpha, all the support vectors have to be stored.
- Recall is painfully slow: The kernel has to be evaluated between each testing instance and all support vectors. This means recall time is linear in the number of support vectors - which is (sort of) linear in the number of training examples!

For this reasons, many state-of-the-art detection systems (which have to handle many bounding boxes and therefore need fast recall) rely on linear SVMs - which seems to work quite well with HoG.

But we don't want to be restricted to linear (in the input space) classifiers.

So what do we do?

The answer seems to be: make phi explicit! Then we can just use standard linear SVMs on Y, using phi(x) as our input representation.

At first sight, this seems somewhat counter-intuitive - at least to me.

Weren't we so proud that we could handle infinite dimensional features using kernels? And now you want to make this explicit?

But let's back up a little bit, to the beginning of this story.

I feel it starts with "Classification using Intersection Kernel Support Vector Machines is Efficient" by Maji, Berg and Malik.

This work stems from the need for fast recall for detection. But Maji et al did not want to be restricted to using a linear kernel. So they thought up a trick to speed up intersection kernel evaluation:

- They decompose the decision function into a sum of function on the coordinates. This is possible, since the intersection kernel is
**additive**, meaning that the results for individual components are just summed up. - For each component, the decision function is piecewise linear (it changes only when the min(x,s_i) between x and the support vectors s_i change).

Using a fixed approximation of the piecewise linear function in 2., the authors even reached O(n).

In Max-Margin Additive Classifiers for Detection, the autors went a step further and used their insight into additive kernels to give explicit embeddings for the intersection kernel based on discretizations.

The next big step was done by Andrea Vedaldi and Andrew Zisserman in "Efficient Additive Kernels via Explicit Feature Maps".

Here, Vedaldi and Zisserman give a general construction for explicit embeddings of additive kernels. The dimensionality of the embedding can be chosen arbitrary and they give an error bound for a given dimensionality.

They apply their theoretical insights to approximating the chi2 kernel with very encouraging results.

The latest work in this line seems to be "Large-Scale Image Categorization with Explicit Data Embedding" by Florent Perronnin, Jorge Sanchez and Yan Liu.

One of the goals of their work is to extend explicit embeddings from additive to multiplicative kernels.

I have to admit that I did not read this work in detail but I think most of the ideas used were already present in the literature:

Using random projections and kernel PCA.

Kernel PCA gives an easy way to find an embedding on which the decision function for a given kernel SVM is linear. It has the significant downside that first the kernel and the Kernel PCA for the training set have to be computed.

Perronnin and his coauthors propose a new method for additive kernels that speeds up this process.

While the authors are able to work with exponential kernels on massive datasets, they conclude that the small improvement over additive kernels is outweighted by the large computational cost of approximating the kernels.

So is approximating non-linear kernels the way to go? It seems some researchers think it is...

But wasn't the point of using the kernels that they are easy to compute? So maybe we should instead look for clever embeddings that work well in practice and stop caring about kernels at all ?

We'll see...

Oh and it seems I didn't really cling as much to Flerent Perronsin's talk as the title suggests. Well at least his talk was what got me interested in these ideas and I think I covered at least partially what he said.

"So is approximating non-linear kernels the way to go? It seems some researchers think it is...

ReplyDeleteBut wasn't the point of using the kernels that they are easy to compute? So maybe we should instead look for clever embeddings that work well in practice and stop caring about kernels at all ?

We'll see."

This is exactly the point I was thinking about after his lecture. Why not look directly for ways to efficiently extract features from a compressed representation?

Well, he also presented the Fisher kernel, which is a more ad-hoc embedding that works well in practice.

ReplyDeletePerronnin's research focus seems to shift in this direction so maybe he'll agree with you there.

On the other hand, being able to model arbitrary decision boundaries as the rbf kernel can is kind of a nice feature ;)