Team 6: Visual words: Text analysis concepts for computer vision

Wednesday, August 5, 2009 - 11:40am - 12:00pm
EE/CS 3-180
Brendt Wohlberg (Los Alamos National Laboratory)


llustration of SIFT
features computed using VLFeat library

Project Description:

Large collections of image and video data are becoming
increasingly common in a diverse range of applications,
consumer multimedia (e.g. href=>flickr and
YouTube), satellite
video surveillance, and medical imaging. One of the most
problems in exploiting such collections is in the retrieval of
content, since the collections are often of sufficient size to
make a
manual search impossible. These problems are addressed in href=>computer
vision research areas such as href=>content-based image
retrieval, automatic image tagging, semantic video
indexing, and
object detection. A sample of the exciting work being done in
areas can be obtained by visiting the websites of leading
groups such as Caltech
Computational Vision
, href=>Carnegie
Mellon Advanced Multimedia Processing Lab, href=>LEAR, href=>MIT CSAIL Vision
Research, >Oxford
Visual Geometry Group
, and href=>WILLOW.

One of the most promising ideas in this area is that of
, constructed by quantizing invariant image features
such as
those generated by href=>SIFT.
These visual word representations allow text document analysis
techniques (href=>Latent
Semantic Analysis, for example) to be applied to computer
vision problems, an interesting
example being the use of href=>Probabilistic
Latent Semantic Analysis or href=>Latent
Dirichlet allocation to learn to recognize categories of
(e.g. car, person, tree) within an image, using a training set
is only labeled to indicate the object categories present in
image, with no indication of the location of the object in the
image. In this project we will explore the concept of visual
understand their properties and relationship with text words,
consider interesting extensions and new applications.


[1] Lowe, David G., Distinctive Image Features from
Scale-Invariant Keypoints,
International Journal of Computer Vision, vol. 60,
no. 2, pp. 91-110, 2004. doi: href=>10.1023/b:visi.0000029664.99615.94

[2] Leung, Thomas K. and Malik, Jitendra, Representing and
Recognizing the
Visual Appearance of Materials using Three-dimensional Textons,
International Journal of Computer Vision, vol. 43, no.
1, pp. 29-44, 2001. doi: href=>10.1023/a:1011126920638

[3] Liu, David and Chen, Tsuhan, DISCOV: A Framework for
Objects in Video, IEEE Transactions on Multimedia,
vol. 10, no. 2, pp. 200-208, 2008. doi: href=>10.1109/tmm.2007.911781

[4] Fergus, Rob, Perona, Pietro and Zisserman, Andrew, Weakly
Supervised Scale-Invariant Learning of Models for Visual
International Journal of Computer Vision, vol. 71, no.
3, pp. 273-303, 2007. doi: href=>10.1007/s11263-006-8707-x

[5] Philbin, James, Chum, Ondřej, Isard, Michael, Sivic, Josef
Zisserman, Andrew, Object retrieval with large vocabularies and
spatial matching, Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition
, 2007. doi: href=>10.1109/CVPR.2007.383172

[6] Yang, Jun, Jiang, Yu-Gang, Hauptmann, Alexander G. and Ngo,
Evaluating bag-of-visual-words representations in scene
classification, Proceedings of the international workshop
on multimedia information retrieval (MIR '07)
, pp.
197-206, 2007. doi: href=>10.1145/1290082.1290111

[7] Yuan, Junsong, Wu, Ying and Yang, Ming, Discovery of
Patterns: from Visual Words to Visual Phrases, IEEE
Conference on Computer Vision and Pattern Recognition
, pp. 1-8, 2007. doi: href=>10.1109/cvpr.2007.383222

[8] Fergus, Rob, Fei-Fei, Li, Perona, Pietro and Zisserman,
Learning object categories from Google's image search, IEEE
International Conference on Computer Vision (ICCV)
, vol.
2, pp. 1816-1823, 2005. doi: href=>10.1109/iccv.2005.142

[9] Quelhas, P., Monay, F., Odobez, J.-M., Gatica-Perez, D.,
T. and Van Gool, L., Modeling scenes with local descriptors and
aspects, IEEE International Conference on Computer Vision
, pp. 883-890, 2005. doi: href=

[10] Sivic, Josef, Russell, Bryan C., Efros, Alexei A,
Zisserman, Andrew
and Freeman, William T., Discovering objects and their location
images, IEEE International Conference on Computer Vision
, pp. 370-377, 2005. doi: href=>10.1109/iccv.2005.77


A strong computational background is essential, preferably with
significant experience in Matlab programming. (While experience
other programming languages such as C, C++, or Python may be
Matlab is likely to be the common language when individual team
contributions need to be integrated into a joint code.)

Some background in areas such as image/signal processing,
optimization theory, or statistical inference would be highly

MSC Code: