Tuesday, July 2, 2013

ICML 2013 Reading List

The ICML is now already over for two weeks, but I still wanted to write about my reading list, as there have been some quite interesting papers (the proceedings are here). Also, I haven't blogged in ages, for which I really have no excuse ;)

There are three topics that I am particularly interested in, which got a lot of attention at this years ICML: Neural networks, feature expansion and kernel approximation, and Structured prediction.

But first:

Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures

James Bergstra, Daniel Yamins, David Cox

This is the newest in a series of papers by James Bergstra on hyperparamter optimization. I quite enjoy his work and his hyperopt software is in active use in my lab. In particular in computer vision applications, there is so much engineering, that it is very hard to separate research contributions from engineering contributions. This paper shows 1) how important engineering is and 2) how far automatization of the engineering part can really go.

Neural Networks

Now, let's come to the somewhat most unlikely candidate, neural networks.
They gained a lot of attention in the more machine-learny circles in the last couple of years. Still I was a bit surprised how many - in particular very empirical papers - made it to ICML.

Regularization of Neural Networks using DropConnect

Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, Rob Fergus
One of the zoo of follow-ups on the drop-out work by Hinton, this paper suggests setting weights to zero, instead of hidden unit activations. It achieves better accuracy and is more efficient than drop-out.

Maxout Networks

Ian Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, Yoshua Bengio

One of the most impressive follow-ups on the drop-out work, this paper demonstrates how to combine drop-out with a maximum nonlinearity.
That's right. The only nonlinearity is the maximum over a group of hidden units.
I feel this is pretty innovative and the results speak for themselves.
The authors argue that the max non-linearity allows the network to learn a linear approximation of any convex activation function. Unfortunately, it is not really clear from the paper how much of the performance can be attributed to the max non-linearity, as there are no results without max-out.

On the importance of initialization and momentum in deep learning

Ilya Sutskever, James Martens, George Dahl, Geoffrey Hinton

This work investigates relations between momentum and Nesterov's accelerated gradients. It argues that together with the right initialization, learning with momentum can yield to much better models.

Kernel Approximation and Feature Extraction

Alex Gittens, Michael Mahoney
This work compares sample based and projection based methods for low rank approximations. I haven't looked into the details yet, but I'm a big fan of the Nystroem method for kernel approximations, so I will definitely see what's in there.

Krishnakumar Balasubramanian, Kai Yu, Guy Lebanon
The authors propose a new sparse coding framework using non-parametric kernel smoothing. They provide generalization bounds for sparse dictionary learning and demonstrate benefits compared to standard sparse coding and Locally Linear Coding.

Structured Prediction

Learning Convex QP Relaxations for Structured Prediction

Jeremy Jancsary, Sebastian Nowozin, Carsten Rother
This is quite exciting work by the folks from MSRC which I met during my internship. They propose to use a QP relaxation for learning structured prediction. Basically they parametrize  the problem in a way that inference via the QP relaxation is always convex and learn this restricted family. I only skimmed it yet ;)

This is a continuation of the authors work on dense random fields for semantic image segmentation. It is another example of "learning for inference". In their previous work, it was shown that mean-field inference can be implemented efficiently by convolutions in certain cases. Here, the authors show how it is possible to directly minimize the loss of the prediction produced by mean-field inference.

There are several more papers on optimization for inference and / or learning,
but I can't possibly list them all. There are also some interesting theory papers, for example on random forests.
Also, I want to mention a paper by a friend, Cho, who writes about
Simple Sparsification Improves Sparse Denoising Autoencoders in Denoising Highly Corrupted where he matches state of the art denoising algorithms using auto-encoders.

That should be enough, otherwise you could just look at the proceedings ;)


  1. Thanks, Andy for mentioning my paper :)

    It seems like I'll have to go over the list of paper at ICML 2013 again. Although I was there myself, just the amount of talks and posters was a bit too overwhelming.

    1. It is really overwhelming. I wish I was there.
      I went over the proceedings three times and still found new interesting stuff I over read in the first two reads.

  2. The last word of Cho's paper title are missing, though

  3. I've referenced the Gitten's et al. paper multiple times in my dissertation. Great work.