His work focuses mainly on pictures of man-made environment but is not exclusive to it. Heberts talk was two hours long and spanned many of his
past and recent work which I can not all repeat here. I will focus on some
main messages and things I got from his presentation.
One of the first works Hebert talked about was classifying regions of an image into "surface orientation" categories. These categories are roughly "ground plane", "sky", "vertical facing right", "vertical facing left", "vertical facing camera". I had seen several works in this direction but never found them to be very interesting. "Why this task?" is what I was always asking myself - I never quite understood the motivation.
In his talk, Hebert made the motivation very clear: This is not really the task we are interested in. It is a subtask that may help.
For example, look at object class segmentation. A widely used approach is CRFs with pair-wise potentials to enforce smoothness. Usually smoothness is enforced between neighbouring regions in an image. But it makes a lot more sense to enforce smoothness between neighbouring objects in the 3D world.
If it is possible to obtain information about the spatial layout of regions, it might be possible to reason better about which regions should semantically be grouped together.
To make reasoning like this possible, Hebert stressed tree points:
- The problem of scene interpretation is hard - too hard to solve it in one go.
- Different clues can help improve each other, iteratively constructing a consistent interpretation.
- Hard decisions should be postponed to the end of the reasoning process.
Detected occlusion can then be used to refine the surface orientation maps.
Other possible sub-tasks are viewpoint estimation and object detection.
All of these subtask can help to refine each other, leading to a consistent and certain interpretation.
- Never commit to a single segmentation/quantization
- Never commit to a single interpretation
occlusions and scene structure. I didn't read the papers but on some images that were shown during the presentation, the results were quite surprising.
In Heberts view, the model assumption about surface orientation and occlusion are still quite weak and he created models including stronger assumptions about 3D stucture - like heavy things
are not able to rest on light things, and volumetric constraints, in particular for indoor scenes (where many objects can be approximated by boxes).
One thing about this work, that you probably noticed is that it is a lot more model driven than most of the work in object class segmentation - well this is scene understanding not just classification.
Hebert proposes to combine statistics (=machine learning) and reasoning (as in AI) to obtain better models for scenes.
I feel it is a good idea to make use of our knowledge about the world to understand scenes - but there are certain things I don't like so much about Hebert's approaches:
- They make strong assumptions about the kind of images that are used.
Most scenes are simple indoor or street scenes with a very clear layout and very orthogonal directions.
- The way prior knowledge is included seems pretty ad-hoc to me. There are quite impressive results on predicting occlusions but what sub-tasks are considered and how they influence each other seemed a little arbitrary to me.
- The models are very complex (Hebert often referred to lot's of gory details that he kept from us) and I didn't see an overall design principle
that may help creating similar algorithms.
- Lot's of supervision is necessary. For example segment label that need to be provided include surface direction, "weight", class, possibly volumetric information.
Well, I still don't really know what to make of this work but at leas now I understood the motivation ;)