There was an interesting talk by Jitendra Malik about "Rich Representations for Learning Visual Recognition" and thereafter a panel discussion with
Jitendra Malik, Yann LeCun, Geoff Hinton, Tomaso Poggio, Kai Yu, Yoshua Bengio and Andrew Ng.
Many "deep" topics were touched but there is one or two ideas that I found the most noteworthy.
The first is the idea by Malik to do "hyper supervision". This is his idea of doing the exact opposite than weak supervision: The training examples are labeled very precisely and with lots of extra information. This makes it possible to find more interesting intermediate representations. It also gives the learning algorithm more to work on.
In his introduction he said:
"Learning object recognition from bounding boxes is like learning language from a list of sentences."
If I understand his ideas correctly, he thinks that is its necessary to have additional clues - like 3D information, tracking and time consistency - to really learn the nature of objects.
Malik illustrated the idea using his work on poselets.
The other thing that caught my attention was that Yann LeCun, one of the founding fathers of neural network learning on images, said that he things it is necessary to make use of more structured models "not only linear combinations" of features, if one wants to make progress on image tasks.
One idea that found general agreement was that deep learning focused to much on classification up till now. Most work in computer vision today is actually in multiple instance learning and segmentation where deep learning still focuses on scene classification with few classes and many examples.
See also my post on the poster by Coates, Lee and Ng on a similar theme.