Tuesday, October 25, 2011

Random Ramblings on ImageNet

After looking trough ImageNet for a little while now, I found some things that I did not really expect. So here are some properties of ImageNet that I found interesting (even though some of them might be obvious).
But first, a quick recap on what ImageNet is:
It's a hand annotated dataset, consisting of 10 million images with 10 thousand object classes. The images were collected using search engines and flickr. Classes correspond to "synsets" in WordNet. A synset is a collection of semantically equivalent nouns.
For example, there is a synset called 'n04037443' (this is the IMID, the image net id), which corresponds to the nouns 'racer, race car, racing car' and is described as 'a fast car that competes in races'.

The synsets in WordNet have an additional hierarchical structure, given by a directed graph. Going down the graph goes from more general concepts to more specific concepts. For example 'mammal' is above 'canine' which has many kinds of dogs as children.

The hierarchy is very fine, in particular the plant and animal species, and categories are hard to distinguish for a human non-expert.

When I talk about ImageNet in the following, I'm talking about the dataset used in the Pascal VOC competition. It is basically a subset of the whole ImageNet consisting of 1.2M training images in 1000 categories and 50K validation images.

  1. One sided labeling: If an image belongs to a certain category, it means an object of that category is contained in the image. It does not mean that it is the prominent object and other object classes might be present.
    For example, most images in the 'n01440764' synset, which is "tench, Tinca tinca", apparently some kind of game fish, contain images of humans holding the fish. Still, the image is labeled as "tench" and a bounding box only exists for the fish.
  2. Many classes are indistinguishable from others by non-experts.
    This is true for many animal classes. But a bit more surprising to me was that I was not able to recognize cabs as such. Depending on the country they might look quite different to what you're used to.
  3. Label noise: Many mislabeled positive example. It's hard to say how many, but when clicking through the images, I found some.
  4. Nodes in the tree do not represent "is a" relationships.
    This, I found rather surprising. When I scanned all the nodes below "car", I was expecting to find images of cars. One of the child nodes is "ambulance". Ambulances can also be helicopters. So there is a significant mount of helicopters below the "car" node.
    There was also a cab that was more of a trike.
  5. Some classes are based on text on the objects. Again, I looked at "ambulance" and "taxi". There are partial views that just show some small part of the object, where it is impossible to tell whether the object in the image is a car or a fridge. It has written "electric taxi" on it, though.
As you might have noticed, I looked at a lot of cars today. Maybe I'll look at birds tomorrow and tell you more about them ;)
Anyway, I am excited to hear how the people in the competition handled the one-sided label problem and I hope my comments gave some insights into this dataset.

My description of the hierarchy above is a bit vague. The readme of the ImageNet challenge reads as follows:

There are three types of image data for this competition: training
data from ImageNet (TRAINING), validation data specific to this
competition (VALIDATION), and test data specific to this competition
(TEST).  There is no overlap in the three sources of data: TRAINING,
VALIDATION, and TEST.  All three sets of data contain images of 1000
categories of objects.  The categories correspond 1-1 to a set of 1000
synsets (sets of synonymous nouns) in WordNet.  An image is in a
particular category X, where X is a noun synset, if the image contains
an X. See [1] for more details of the collection and
labeling strategy.

The 1000 synsets are selected such that there is no overlap between
synsets, for any sysnets i and j, i is not an ancestor of j in the
WordNet hierarchy. We call these synsets "low level synsets".

Those 1000 synsets are part of the larger ImageNet hierarchy and we
can consider the subset of ImageNet containing the 1000 low level
synsets and all of their ancestors. There are 905 such ancestor
synsets, which we refer to as "high level synsets". In this hierarchy,
all the low level synsets are "leaf" nodes and the high level synsets
are "internal" nodes.

Note that the low level synsets may have children in ImageNet, but for
ILSVRC 2011 we do not consider their child subcategories. The
hierarchy here can be thought of as a "trimmed" version of the
complete ImageNet hierarchy.

Also note that for this competition, all ground truth labels are low
level synsets and entries must predict labels corresponding to one of
the 1000 low level synsets.  Predicting high level synsets is not
considered. There are no additional training images for high level


PS: I found a picture that's not a natural image but a painting:
image 10119 in synset n02071294 ^^


  1. A quick question. Does an image in the training data have multiple labels ? If so, is there information to find out all labels of a image ? I assume some kind of Image ID is required to do this. Would like to know if it is available ?

  2. Actually, each image has only a single training label. This is supposed to be the "dominant" object in the image. This is true for many classes, for example tools that are usually segmented.
    For the fish I mentioned, the image is only labeled as this fish and not as also containing a human.