May 22 - 26, 2006
We address the problem of human motion recognition in this paper. The goal
of human motion recognition is to recognize the type of motion recorded in a
video clip, which consists of a set of temporarily ordered frames. By
defining a Mercer kernel between two video clips directly, we propose in
this paper a recognition strategy that can incorporate both the information
of each individual frame and the temporal ordering between frames. Combining
the proposed kernel with the support vector machine, which is one of the
most effective classification paradigms, the resulting recognition strategy
exhibits excellent performance over real data sets.
Joint work of Dongwei Cao, Osama T Masoud, Daniel Boley, and Nikolaos Papanikolopoulos
Joint work with Björn Ommer.
This contribution proposes a compositional approach to visual
object categorization of scenes. Compositions are learned from the
CalTech 101 database and form intermediate abstractions of images that
are semantically situated between low-level representations and the
categorization. Salient regions, which are described by localized feature
histograms, are detected as image parts. Subsequently compositions
are formed as bags of parts with a locality constraint. Image categorization
is finally achieved by applying coupled probabilistic kernel classifiers
to the bag of compositions representation of a scene. In contrast to the
discriminative training of the categorizer, intermediate compositions are
learned in a generative manner yielding relevant part agglomerations,
i.e. groupings which are frequently appearing in the dataset while
supporting the discrimination between sets of categories.
Consequently, compositionality simplifies the learning of a complex
model for complete scenes by splitting it up into simpler,
sharable compositions. The architecture is evaluated on the highly
CalTech 101 database which exhibits large intra-category variations.
Our compositional approach yields a significant enhancement over
a baseline model with the same feature representation but without
It shows competitive retrieval rates in the range of 52.2±2.6% on the
Visual clues for the shapes of objects and their positions
involve the interaction of:
geometric features of the objects, the shade/shadow regions
on the objects (and specularity), and the (apparent) contours
resulting from viewer direction. Visual recognition often also
involves a use of small movement in the viewer direction to decide
among ambiguities in the local clues.
We give the classification of the local configurations
for case of a fixed single light source and perfectly diffuse
objects (without specular effects) having generic local geometric
features (edges, creases, corners and marking curves).
The classification includes both the "stable views", in
which the configurations do not change under small viewer movement,
and the generic transitions in local configurations which a
viewer expects to see under larger viewer movement.
The classification of stable views reduces to an
"alphabet" of local curve configurations, where certain ones have
The classification is obtained by applying a rigorous
mathematical analysis using methods from singularity theory.
These results describe joint work of the presenter with
Peter Giblin and Gareth Haslinger.
In many applications the task appears to reconstruct a segmented image
from indirectly measured data. Examples are medical applications,
e.g. creating tomographic images from electromagnetic or X-ray data,
or geophysical application as for example finding landmines or
characterizing a petroleum reservoir. We will focus here on the latter
application, even though our method will be quite general. Typically,
in these applications first a pixel-based image is created using
standard reconstruction techniques. Then, segmentation methods are
applied to this image. The drawback of this two-step approach is that
most segmentation tools change the reconstructed images without taking
into account the original data. It would be desirable in practical
applications to construct a segmented image directly from the data. We
present here a novel reconstruction scheme which is able to achieve
this goal. It combines a level set technique for segmenting the images
during the reconstruction from the data and a pixel-based correction
scheme in each of the segmented regions. We show that this novel
technique is able to provide segmented images which minimize a least
squares data misfit cost functional.
This is a collaborative project of Universidad Carlos III de Madrid
and Repsol-YPF, Spain.
University partners: Rossmary Villegas, Oliver Dorn, Manuel
Kindelan, Miguel Moscoso (UC3M). Industrial partners:
Elena Izaguirre, Francisco Mustieles (Repsol-YPF, Spanish oil company).
Given a large dataset of images, we seek to automatically discover
visually similar object classes, together with their spatial extent
(segmentation). The hope is that we will be able to automatically
recover commonly occurring objects, such as cars, trees, buildings,
etc. Our approach is to first obtain multiple segmentations of each
image, and to make the assumption that each object instance is
correctly segmented by at least one segmentation. The problem is then
reduced to finding clusters of correctly segmented objects within this
large "segment soup," i.e. one of grouping in the space of candidate
The main insight of the paper is that segments corresponding to
objects will be exactly the ones represented by coherent clusters,
whereas segments overlapping object boundaries will need to be
explained by a mixture of several clusters. To paraphrase Leo Tolstoy:
all good segments are alike, each bad segment is bad in its own way.
Joint work with Bryan Russell, Josef Sivic and Andrew Zisserman
In collaboration with Leonardis, Ales, and Berginc, Gregor.
With the growing interest in object categorization various
have emerged that perform well in this challenging task, yet
are inherently limited to only a moderate number of object classes.
In pursuit of a more general categorization system our framework
proposes a way to overcome the computational complexity
encompassing the enormous number of different object categories
by exploiting the statistical properties of the highly structured
visual world. Our approach proposes a hierarchical acquisition
of generic parts of object structure, varying from simple to more
complex ones, which stem from the favorable statistics of
natural images. The parts recovered in the individual
layers of the hierarchy can be used in a top-down manner
resulting in a robust statistical engine that could be
efficiently used within many of the current
categorization systems. The proposed approach has been applied
to large image datasets yielding important statistical insights
into the generic parts of object structure.
We investigate the learning of the appearance of an object from a
single image of it. Instead of using a large number of pictures of
the object to recognize, we use a labeled reference database of
pictures of other objects to learn invariance to noise and
variations in pose and illumination. This acquired knowledge is then
used to predict if two pictures of new objects, which do not appear
on the training pictures, actually display the same object.
We propose a generic scheme called chopping to address this
task. It relies on hundreds of random binary splits of the training
set chosen to keep together the images of any given object. Those
splits are extended to the complete image space with a simple
learning algorithm. Given two images, the responses of the split
predictors are combined with a Bayesian rule into a posterior
probability of similarity.
Experiments with the COIL-100 database and with a database of 150
degraded LaTeX symbols compare our method to a classical learning
with several examples of the positive class and to a direct learning
of the similarity.
We propose an algorithmic solution for simultaneous detection of
landmarks in brain MRI. A landmark is a point of the image that
corresponds to a well-defined point in the anatomy and characterizes the
local geometry of the brain.
The location of the landmarks is identified with a unique deformation of
the underlying 3D space. We consider a set of non-rigid deformations
where the landmarks act as control points, extending the deformation to
the whole domain by spline interpolation.
We build a probabilistic model for the intensities of the MR image given
the landmark locations. We use a training set of hand-landmarked images
to estimate the parameters of the model. The resulting atlas is sharp in
the vicinity of the landmarks, where the deformation is given, and more
diffuse at greater distance of the control points.
In a new image, the landmark locations are estimated using a gradient
ascent algorithm on the likelihood function. It produces a partial
registration, which is more accurate in the vicinity of the landmarks.
We applied the algorithm to the localization of three landmarks
belonging to the
In vision, it is important to match regions or objects that are not
related by simple rigid or linear transformations. For example, looking
at a 3D surface from different viewpoints causes complex deformations in
its appearance. When the parts of objects articulate, their shape
changes non-linearly. When we compare different instances of objects
from the same class, their shape may vary in complex ways. To capture
deformations, we use a framework in which we embed images as 2D surfaces
in 3D. We then show that as we vary the parameters of this embedding,
geodesic distances on the surface become deformation-invariant. This
allows us to build deformation-invariant descriptors using geodesic
distance. For binary shapes, we develop related descriptors, using the
inner distance, which is invariant to articulations and captures the part
structure of objects. We evaluate these descriptors on a number of data
sets. In particular, the inner distance forms the basis of a shape
comparison method that we use to identify the species of plants from the
shape of their leaves. This is being used in a project, with the
Smithsonian Institute and Columbia University, to develop an Electronic
Field Guide that botanists can use to identify new species of plants.
We present a method for recognizing scene categories based on
approximate global geometric correspondence. This method works
by partitioning the image into increasingly fine sub-regions and
computing histograms of local features found inside each sub-region.
The resulting ``spatial pyramid'' is a simple and computationally
efficient extension of an orderless bag-of-features image
representation, and it shows significantly improved performance
on challenging scene categorization tasks. Specifically, our
proposed method exceeds most previously published methods on the
Caltech-101 database and achieves high accuracy on a large
database of fifteen natural scene categories. The spatial pyramid
framework also offers insights into the success of several recently
proposed image descriptions, including Torralba's "gist" and
Lowe's SIFT descriptors.
Hierarchical object recognition models employ a number of different
strategies to learn the feature detectors at the various levels:
fragment selection by mutual information in Ulman's model, Fixed Gabor
wavelets and fragment selection in Poggio's model, supervised gradient
descent in LeCun's convolutional nets, layer-by-layer unsupervised
learning followed by supervised gradient descent in Hinton's
stacked restricted Boltzmann machine model.
We will present a new unsupervised algorithm for feature learning, and
compare classification performance obtained by training a
convolutional net in the conventional (supervised) way, with the same
convolutional net where the features have been initialized using the
new unsupervised method.
We present a novel unsupervised learning method for human action
categories. A video sequence is represented as a collection of
spatial-temporal words by extracting space-time interest points. The
algorithm learns the probability distributions of the spatial-temporal
words and intermediate topics corresponding to human action categories
automatically using a probabilistic Latent Semantic Analysis (pLSA)
model. The learned model is then used for human action categorization
and localization in a novel video. We test our algorithm on two
datasets: the KTH human action dataset and a recent dataset of ï¬^Águre
skating actions. Our results are on par or slightly better than the best
reported results. In addition, our algorithm can recognize and localize
multiple actions in long and complex video sequences containing multiple
The classical model of visual processing in cortex is a hierarchy of
increasingly sophisticated representations, extending the Hubel and
Wiesel model of simple to complex cells in a natural way. Neuroscience
work in the Poggio Lab focuses on a computational model of object
recognition which is based on these principles and consistent with a
number of recent physiological and psychophysical experiments. The
goal of our work is to explain cognitive phenomena in terms of simple
and well-understood computational processes in a physiologically
plausible model. Examples of ongoing projects using our computational
model to motivate experiments and guide the analysis of experimental
1) read-out and classification of IT neural data to
investigate the neural basis of object categorization (in
collaboration with the DiCarlo lab and Miller labs),
2) simulaton of biophysically plausible mechanisms for the key operations
underlying object recognition in the model (in collaboration with
Christof Koch at Caltech and David Ferster at Northwestern),
3) analysis and prediction of neural data in intermediate visual areas,
4) comparison of model performance to humans as well as
state-of-the-art machine vision systems (in collaboration with the
Joint work with Ulf Knoblich, Minjoon Kouh, Gabriel Kreiman, Ethan Meyers, and Thomas Serre.
Information about object images is transmitted from the primary visual
cortex to the inferotemporal (IT) cortex through multiple pre-striate
areas in macaque monkeys. To understand neural mechanisms for object
recognition, we investigate neuronal representation of object images in
area TE of macaque monkeys.
To achieve this goal, we combined single cellular recording techniques
and optical imaging techniques that enable us to visualize neural
activation at columnar levels. It is essential to use imaging techniques
because previously it has been shown that neurons in area TE respond to
geometrically less complex features rather than to more complex real
Major conclusions are: (1)an object is represented by a combination of
cortical columns, each of which represents a visual feature, (2) a
specific combination for an object is made up of both active and inactive
feature columns, (3) the feature columns do not necessarily represent
spatially localized visual features, and (4) some visual features
represented by columns are more related to global features, such as
spatial arrangements of parts.
Despite several decades of research in the field of computer vision, there still exists no recognition system which is able to match the
visual performance of humans. The apparent ease with which visual tasks such as recognition and categorization are solved by humans is
testimony of a highly optimized visual system which not only exhibits excellent robustness and generalization capabilities but is in
addition highly flexible in learning and organizing new data. Using an integrative approach to the problem of object recognition we have
developed a framework that combines cognitive psychophysics, computer vision as well as machine learning. This framework is able to model
results from psychophysics and, in addition, delivers excellent recognition performance in computational recognition experiments.
Furthermore, the framework also interfaces well with advanced classification schemes from machine learning thus further broadening the
scope of application.
The framework is derived from and motivated by results from cognitive research - in particular on the dynamic aspect of recognition
processes, the contribution of which has up to now been largely neglected in theoretical modeling of recognition. Several recent
experiments in perceptual research have found the temporal dimension to play a large role in object recognition - both by being able to
mediate learning of object representations and by providing an integral part of the representation itself. In addition, we have conducted
experiments that shed light on how objects might be represented in the brain using local pictorial features and their spatial relations
thus forming a sparse and at the same time structured object representation. Based on these results, the computational implementation of
the framework provides spatiotemporal processing of visual input by means of a structured, appearance-based object representation.
We have shown that several experiments can successfully be modeled using a computational implementation of the framework, which
demonstrates the perceptual plausibility of spatio-temporal, local feature processing. In addition, based on the computational modeling
results, a number of performance predictions can be made that can be tested in further psychophysical experiments thus closing the loop
between experimental work and computational modeling.
In addition, we have demonstrated the benefits of using spatio-temporal information in several recognition experiments on both artificial
and real-world data in several computational recognition experiments (including experiments run on a robotic platform). In this context,
we have developed a novel method for combining spatio-temporal object representations based on local features and state-of-the-art
statistical learning methods (Support Vector Machines). Several recognition experiments have shown that by combining efficient
representations from computational vision with robust classification schemes from machine learning, excellent recognition performance can
In this paper, we present an overview of the psychophysical and computational experiments and discuss the benefits and limitations of the
In a private vision quest meeting hold in Maine many years ago, an
Jennifer Mumford—wife of David Mumford—divided the audience into
People who faced the screen were asked to describe the contents of a
picture to those who looked away from the screen. The latter then
compared the pictures visualized in their minds based on what they heard
against the real picture on the screen. It was of no surprise that a
thousand English words were not enough for a picture. Simply we don't
have the right words to describe natural images yet.
If humans do not know at the first place, how do we program computers to
understand a large set of images?
Many people would argue, hey, shouldn't computers learn to understand
images by themselves in an unsupervised way, especially in a
discriminative task for image classification?
But, look, no parents in the world would send their kids to schools that
have no teachers, no generative syllabus to explain how the world works,
but only have endless quizzes and final exams, unless they cannot afford
any quality education for their kids.
In 2005, a number of vision people, established a non-profit
organization—the Lotus Hill Institute in Ezhou, China—held a 1 week workshop, and
discussed the "image ground truth", sponsored by NSF of both USA and
China. One primal objective of the Lotus Hill efforts is to map the
visual vocabulary in a huge set of natural images, and to develop a
common generative model for image. This work could be in spirit compared
to the Genome project in biology, though in a much smaller scale due to
the lack of funding. The institute currently has 37 employees including
17 art people for image annotation. This poster will report some
examples of the parsed images. We call for broad collaborations and
contributions from the CVPR community.
Acknowledgement: David Mumford, Alan Yuille, Harry Shum, Fei-Fei Li,
David Martin, Charless Fowlkes, Yanxi Liu, ..., and many others in China
Xiaoming Fan, Zhenyu Yao, Liang Lin, Tianfu Wu, Feng Min, ... We welcome
more people to participate. The website of Lotus Hill Institute is at