This is not really my area so while I really liked his very enthusiastic talk, I won't say much about it ;)
What I liked about it most was how it was motivated. From a computer vision perspective, I felt human action recognition was somewhat peripheral to current research - surely with many interesting applications but not central to the field.
However, Ivan Laptev had two arguments for human action recognition to be a center piece of visual understanding:
- Most of the data out there - in particular videos - show people:
35% of the pixels in TV and movies belong to people, 40% on youtube.
Laptev concludes that video analysis is human action analysis.
- The semantics of objects can often be inferred from humans interacting with it. Instead of the classical "chair" example, Laptev showed a "luggage train":
the thing in the airport you pick up your luggage from. Even if it's your first
time at an airport and you never heard the word "luggage train" before, it is
easy for you to grasp the concept once you see other people interact with it.
I found this a very interesting point.
Laptev talked mainly about the history of action recognition, different tasks and the difficulties they have collecting datasets like "Hollywood 2".
He also explained to algorithms from his group.
The first was conceptually very simple: Use bag of words on HoG-like features in spacetime.
By spacetime I mean the 3D space consisting of the image pixels and the video frames. Gradient histograms can easily be extended to this grid (though one has to take care of the scale in the time direction). One can use spacetime Harris corners as interest points but again dense sampling
seems to work better.
Apparently the results of this approach are very good. Which is pretty sweet, given they didn't really have to do anything problem specific.
The other method was more specific to videos. There is a task where an action is captured by cameras from different views and one has to transfer a model learned on one view to another view.
The views in training and testing are disjoint and might be very different, for examle front and overhead views of a person.
One intuitive approach would be to use a stick-figure model of the person and try to infer the 3D position of the limps. Laptev's group wanted to do something less hand-designed and more robust that this. What they came up with is temporal self-similarity. This is basically a distance matrix between tracked patches (or the whole image) over time.
The idea behind this temporal self similarity is, that things that are simlilar in 3D over time remain similar in any 2D projection while things that are not similar in 3D are probably not similar in a 2D projection.
While I feel the task to be solved is somewhat artificial, I like the idea of using self-similarity as a feature. I wonder if that could be used somewhere else....