Some tips and facts that I took from the summer school. They are pretty
random but may be usefull for pactitioners of vision.
Many may seem obvious - but I just didn't see it before ...
Gist doesn't work on cropped or rotated images. Since it does a kind of whole image template matching, this is pretty clear. And maybe it shouldn't - the scene layout is changed after all.
For doing BoW Cordelia Schmid suggests (and I guess uses) 6x6 patches and a pyramid with scale factor 1.2. Scaling is done by Gaussian convolution.
Ponce uses 10x10 patches to do sparse coding.
If you combine multiple features using MKL or by just adding up kernels (which
is the same as concatenating features), normalize each feature by it's variance and
then search for a joint gamma. This heuristic get's you out of doing grid search
over a huge space!
Cordelia Schmid thinks that "clever clusters" don't help much in doing BoW.
She thinks it's more important to work on the method - for example by
using spacial pyramid kernels - than trying to find "good" clusters.
Since they only serve as a descriptor afterwards, even descriminatively
trained clusters don't help much at a larger scale.
Pyramid matching kernels don't improve upon BoW on Pascal. There is no
fixed layout that could be matched.
MKL means group sparsity. Since there is a L2 penalty per kernel, the only difference the MKL is to put an additional group L1 penalty on the features.
MKL can be used if you have many features and you don't know which are good - or if you have many kernel parameters. In principle one can just combine different
kernels with MKL and the "best" will be used.
Norms that induce stronger sparcity than L1 - like L0.5 - are not often used
since they don't lead to convex optimizations.
If you want to train a sparse dictionary for denoising, use random filters
as an initalization to train on a big dataset of patches.
Then use this dictionary as an initialization to train on the image
you actually want to denoise.
To rebalance an unbalanced dataset, set C_pos = 1 / num_pos, C_neg = 1 / num_neg.
This corresponds (nearly) to reproducing the examples.
If you are doing kernel PCA, don't forget to center the kernel first!
If you have a lot of data, dense sampling for Bow works always better than keypoints.
When training a sliding window detector, always use bootstrapping to get hard negatives.
For training CRFs on images, the pairwise features are usually a lot simpler than the unary potentials. Therefor it pays to first train the unary potentials, fix them, and then train the pairwise. (Christopher Lampered told me that training unary potentials again afterwards doesn't really help and it is better to have strong classifiers like SVMS than to adjust to the pairwise potentials).
Don't use MNIST or Caltech in your work. These datasets have no point any more.
If you do classification, do at least Pascal, better ImageNet - or another one
of the big natural image datasets. It doesn't look sensible to try and learn
"chair" from Pascal.
Use SGD only for empirical risk minimization, not when you want a good optimization.
When doing KMeans, the inverse of the Hessian is given by the inverse of the number
of points belonging to a cluster. Using this as a learning rate and SGD makes
KMeans on large dataset very fast.
The screeching of an analog modem before connecting is a calibration procedure using online learning.
In MNIST (which you should not use), the training data comes from 600 writers, the test
data from 100 different writers!
When using BoW, the Hellinger kernel works usually better than a linear kernel.
In other words: take the square-roots of the histograms before training a linear
You can do max-pooling on signed features by doubling the feature size and representing
each coordinate by a tuple max(0,x), max(0,-x).
In TV and cinema, 35% of all pixels belong to people. On youtube, it's 40%.