Sunday, July 31, 2011

[CVML] Martial Hebert: Using Geometric information in reconition and scene analysis

The last talk of the CVML summer school (yes, I'm starting from the back ;) was by Martial Hebert about using scene geometry for recognition and segmentation.
His work focuses mainly on pictures of man-made environment but is not exclusive to it. Heberts talk was two hours long and spanned many of his
past and recent work which I can not all repeat here. I will focus on some
main messages and things I got from his presentation.

One of the first works Hebert talked about was classifying regions of an image into "surface orientation"  categories. These categories are roughly "ground plane", "sky", "vertical facing right", "vertical facing left", "vertical facing camera". I had seen several works in this direction but never found them to be very interesting. "Why this task?" is what I was always asking myself - I never quite understood the motivation.

In his talk, Hebert made the motivation very clear: This is not really the task we are interested in. It is a subtask that may help.
For example, look at object class segmentation. A widely used approach is CRFs with pair-wise potentials to enforce smoothness. Usually smoothness is enforced between neighbouring regions in an image. But it makes a lot more sense to enforce smoothness between neighbouring objects in the 3D world.

If it is possible to obtain information about the spatial layout of regions, it might be possible to reason better about which regions should semantically be grouped together.
 To make reasoning like this possible, Hebert stressed tree points:
  • The problem of scene interpretation is hard - too hard to solve it in one go.
  • Different clues can help improve each other, iteratively constructing a consistent interpretation.
  • Hard decisions should be postponed to the end of the reasoning process.
For example, the output of the surface orientation algorithm shouldn't be a hard assignment but rather a confidence map. This can then be used as an additional clue for another algorithm like occlusion detection.

Detected occlusion can then be used to refine the surface orientation maps.

Other possible sub-tasks are viewpoint estimation and object detection.

All of these subtask can help to refine each other, leading to a consistent and certain interpretation.
Hebert emphasised:
  • Never commit to a single segmentation/quantization
  • Never commit to a single interpretation
These iterative algorithms produced quite interesting results on predicting
occlusions and scene structure. I didn't read the papers but on some images that were shown during the presentation, the results were quite surprising.

In Heberts view, the model assumption about surface orientation and occlusion are still quite weak and he created models including stronger assumptions about 3D stucture - like heavy things
are not able to rest on light things, and volumetric constraints, in particular for indoor scenes (where many objects can be approximated by boxes).

One thing about this work, that you probably noticed is that it is a lot more model driven than most of the work in object class segmentation - well this is scene understanding not just classification.
Hebert proposes to combine statistics (=machine learning) and reasoning (as in AI) to obtain better models for scenes.

I feel it is a good idea to make use of our knowledge about the world to understand scenes - but there are certain things I don't like so much about Hebert's approaches:
  • They make strong assumptions about the kind of images that are used.
    Most scenes are simple indoor or street scenes with a very clear layout and very orthogonal directions.
  • The way prior knowledge is included seems pretty ad-hoc to me. There are quite impressive results on predicting occlusions but what sub-tasks are considered and how they influence each other seemed a little arbitrary to me.
  • The models are very complex (Hebert often referred to lot's of gory details that he kept from us) and I didn't see an overall design principle
    that may help creating similar algorithms.
  • Lot's of supervision is necessary. For example segment label that need to be provided include surface direction, "weight", class, possibly volumetric information.
Somehow the strong supervision that goes beyond the task at hand reminds me of the "poselet" work of Malik's group.

Well, I still don't really know what to make of this work but at leas now I understood the motivation ;)

No comments:

Post a Comment