We used ImageNet as a training
set to classify images from Wikipedia.
In the first proof of concept experiments, described in our short paper at the Vision and Language Workshop 2012, we used the subset of synsets of the ILSVRC2010 challenge, which contains total of 1,043,415 images of 1000 categories.
A baseline bag-of-visual-words approach was used. We took the implementation provided by ImageNet which is based on dense feature extraction using SIFT at multiple scales and pooling using hard voting with the visual vocabulary built with K-means (K=1,000). Therefore, for each image, a 1000-dimensional feature vector was generated.
We tried 3 different classifiers:
Please follow this link for classification results using the 4 methods above. Note that the images are clickable, they are linked to the original Wikipedia page for the test images or to the detected sysnset. The images shown in the test resutls are just representative of their synset, they are not the nearest image to the test image.
This link shows the 6 nearest synsets for each image, and for each synset, it presents the nearest image of the training set.