### [CVML] Random Facts and Advice

Some tips and facts that I took from the summer school. They are pretty

random but may be usefull for pactitioners of vision.

Many may seem obvious - but I just didn't see it before ...

Gist doesn't work on cropped or rotated images. Since it does a kind of whole image template matching, this is pretty clear. And maybe it shouldn't - the scene layout is changed after all.

For doing BoW Cordelia Schmid suggests (and I guess uses) 6x6 patches and a pyramid with scale factor 1.2. Scaling is done by Gaussian convolution.

Ponce uses 10x10 patches to do sparse coding.

If you combine multiple features using MKL or by just adding up kernels (which

is the same as concatenating features), normalize each feature by it's variance and

then search for a joint gamma. This heuristic get's you out of doing grid search

over a huge space!

Cordelia Schmid thinks that "clever clusters" don't help much in doing BoW.

She thinks it's more important to work on the method - for example by

using spacial pyramid kernels - than trying to find "good" clusters.

Since they only serve as a descriptor afterwards, even descriminatively

trained clusters don't help much at a larger scale.

Pyramid matching kernels don't improve upon BoW on Pascal. There is no

fixed layout that could be matched.

MKL means group sparsity. Since there is a L2 penalty per kernel, the only difference the MKL is to put an additional group L1 penalty on the features.

MKL can be used if you have many features and you don't know which are good - or if you have many kernel parameters. In principle one can just combine different

kernels with MKL and the "best" will be used.

Norms that induce stronger sparcity than L1 - like L0.5 - are not often used

since they don't lead to convex optimizations.

If you want to train a sparse dictionary for denoising, use random filters

as an initalization to train on a big dataset of patches.

Then use this dictionary as an initialization to train on the image

you actually want to denoise.

To rebalance an unbalanced dataset, set C_pos = 1 / num_pos, C_neg = 1 / num_neg.

This corresponds (nearly) to reproducing the examples.

If you are doing kernel PCA, don't forget to center the kernel first!

If you have a lot of data, dense sampling for Bow works always better than keypoints.

When training a sliding window detector, always use bootstrapping to get hard negatives.

For training CRFs on images, the pairwise features are usually a lot simpler than the unary potentials. Therefor it pays to first train the unary potentials, fix them, and then train the pairwise. (Christopher Lampered told me that training unary potentials again afterwards doesn't really help and it is better to have strong classifiers like SVMS than to adjust to the pairwise potentials).

Don't use MNIST or Caltech in your work. These datasets have no point any more.

If you do classification, do at least Pascal, better ImageNet - or another one

of the big natural image datasets. It doesn't look sensible to try and learn

"chair" from Pascal.

Use SGD only for empirical risk minimization, not when you want a good optimization.

When doing KMeans, the inverse of the Hessian is given by the inverse of the number

of points belonging to a cluster. Using this as a learning rate and SGD makes

KMeans on large dataset very fast.

The screeching of an analog modem before connecting is a calibration procedure using online learning.

In MNIST (which you should not use), the training data comes from 600 writers, the test

data from 100 different writers!

When using BoW, the Hellinger kernel works usually better than a linear kernel.

In other words: take the square-roots of the histograms before training a linear

kernel.

You can do max-pooling on signed features by doubling the feature size and representing

each coordinate by a tuple max(0,x), max(0,-x).

In TV and cinema, 35% of all pixels belong to people. On youtube, it's 40%.

random but may be usefull for pactitioners of vision.

Many may seem obvious - but I just didn't see it before ...

Gist doesn't work on cropped or rotated images. Since it does a kind of whole image template matching, this is pretty clear. And maybe it shouldn't - the scene layout is changed after all.

For doing BoW Cordelia Schmid suggests (and I guess uses) 6x6 patches and a pyramid with scale factor 1.2. Scaling is done by Gaussian convolution.

Ponce uses 10x10 patches to do sparse coding.

If you combine multiple features using MKL or by just adding up kernels (which

is the same as concatenating features), normalize each feature by it's variance and

then search for a joint gamma. This heuristic get's you out of doing grid search

over a huge space!

Cordelia Schmid thinks that "clever clusters" don't help much in doing BoW.

She thinks it's more important to work on the method - for example by

using spacial pyramid kernels - than trying to find "good" clusters.

Since they only serve as a descriptor afterwards, even descriminatively

trained clusters don't help much at a larger scale.

Pyramid matching kernels don't improve upon BoW on Pascal. There is no

fixed layout that could be matched.

MKL means group sparsity. Since there is a L2 penalty per kernel, the only difference the MKL is to put an additional group L1 penalty on the features.

MKL can be used if you have many features and you don't know which are good - or if you have many kernel parameters. In principle one can just combine different

kernels with MKL and the "best" will be used.

Norms that induce stronger sparcity than L1 - like L0.5 - are not often used

since they don't lead to convex optimizations.

If you want to train a sparse dictionary for denoising, use random filters

as an initalization to train on a big dataset of patches.

Then use this dictionary as an initialization to train on the image

you actually want to denoise.

To rebalance an unbalanced dataset, set C_pos = 1 / num_pos, C_neg = 1 / num_neg.

This corresponds (nearly) to reproducing the examples.

If you are doing kernel PCA, don't forget to center the kernel first!

If you have a lot of data, dense sampling for Bow works always better than keypoints.

When training a sliding window detector, always use bootstrapping to get hard negatives.

For training CRFs on images, the pairwise features are usually a lot simpler than the unary potentials. Therefor it pays to first train the unary potentials, fix them, and then train the pairwise. (Christopher Lampered told me that training unary potentials again afterwards doesn't really help and it is better to have strong classifiers like SVMS than to adjust to the pairwise potentials).

Don't use MNIST or Caltech in your work. These datasets have no point any more.

If you do classification, do at least Pascal, better ImageNet - or another one

of the big natural image datasets. It doesn't look sensible to try and learn

"chair" from Pascal.

Use SGD only for empirical risk minimization, not when you want a good optimization.

When doing KMeans, the inverse of the Hessian is given by the inverse of the number

of points belonging to a cluster. Using this as a learning rate and SGD makes

KMeans on large dataset very fast.

The screeching of an analog modem before connecting is a calibration procedure using online learning.

In MNIST (which you should not use), the training data comes from 600 writers, the test

data from 100 different writers!

When using BoW, the Hellinger kernel works usually better than a linear kernel.

In other words: take the square-roots of the histograms before training a linear

kernel.

You can do max-pooling on signed features by doubling the feature size and representing

each coordinate by a tuple max(0,x), max(0,-x).

In TV and cinema, 35% of all pixels belong to people. On youtube, it's 40%.

Thanks for sharing these tips =)

ReplyDelete