Joint work with Jay Hegdé and Daniel Kersten
(University of Minnesota).
Although object segmentation seems easy for people under normal viewing
conditions, there are examples (such as camouflaged animals in natural
surroundings) in which difficult segmentation still results in adequate
object recognition. Machine vision algorithms often fail to detect objects
in these difficult tasks. To study the limits of object detection, we have
tested human observers on their ability to detect objects in clutter. We
used novel objects developed by Brady and Kersten called digital embryos in
a background of self-similar embryo clutter (JOV 2003). Although human
observers have demonstrated the abilities to detect, recognize, and segment
these objects, there has not yet been a computational solution for this
segmentation problem (Brady and Kersten 2003). Our subsequent studies have
demonstrated that after learning to recognize digital embryos, the correct
detection of learned embryos leads to a differential BOLD activation in the
lateral occipital area (LOA) and the dorsal focus (DF) than undetected
learned embryos.
Yali Amit (University of Chicago)
, Yann LeCun (New York University) http://yann.lecun.com
Yali Amit (University of Chicago)
, Jitendra Malik (University of California) http://www.cs.berkeley.edu/~malik/, Yair Weiss (Hebrew University)
, Song Chun Zhu (University of California)
Generative Models Advocates: Zhu, Amit Critics: Malik, Weiss
Daniel Boley (University of Minnesota Twin Cities)
, Dongwei Cao (University of Minnesota Twin Cities)
Motion Recognition with a Convolution Kernel
We address the problem of human motion recognition in this paper. The goal
of human motion recognition is to recognize the type of motion recorded in a
video clip, which consists of a set of temporarily ordered frames. By
defining a Mercer kernel between two video clips directly, we propose in
this paper a recognition strategy that can incorporate both the information
of each individual frame and the temporal ordering between frames. Combining
the proposed kernel with the support vector machine, which is one of the
most effective classification paradigms, the resulting recognition strategy
exhibits excellent performance over real data sets.
Joint work of Dongwei Cao, Osama T Masoud, Daniel Boley, and Nikolaos Papanikolopoulos
Joachim M. Buhmann (Eidgenössische TH Hönggerberg)
Learning Compositional Categorization Models
Joint work with Björn Ommer.
This contribution proposes a compositional approach to visual
object categorization of scenes. Compositions are learned from the
CalTech 101 database and form intermediate abstractions of images that
are semantically situated between low-level representations and the
highlevel
categorization. Salient regions, which are described by localized feature
histograms, are detected as image parts. Subsequently compositions
are formed as bags of parts with a locality constraint. Image categorization
is finally achieved by applying coupled probabilistic kernel classifiers
to the bag of compositions representation of a scene. In contrast to the
discriminative training of the categorizer, intermediate compositions are
learned in a generative manner yielding relevant part agglomerations,
i.e. groupings which are frequently appearing in the dataset while
simultaneously
supporting the discrimination between sets of categories.
Consequently, compositionality simplifies the learning of a complex
categorization
model for complete scenes by splitting it up into simpler,
sharable compositions. The architecture is evaluated on the highly
challenging
CalTech 101 database which exhibits large intra-category variations.
Our compositional approach yields a significant enhancement over
a baseline model with the same feature representation but without
compositions.
It shows competitive retrieval rates in the range of 52.2±2.6% on the
Caltech 101
data base.
Joachim M. Buhmann (Eidgenössische TH Hönggerberg)
, Pedro F. Felzenszwalb (University of Chicago)
, Jitendra Malik (University of California) http://www.cs.berkeley.edu/~malik/
We announce results on
Paley Wiener theorems and descrepancy estimates for point clouds with
applications to scattering and vision.
This is joint work with Devaney (Northeastern), Luke (Delaware) and
Maymeskul (Georgia Southern).
(Supported, in part by grants EP/C000285
and NSF-DMS-0439734. S. B. Damelin thanks the Institute for Mathematics
and Applications for their hospitality.)
James Damon (University of North Carolina)
Local Views of Illuminated Surfaces
Visual clues for the shapes of objects and their positions
involve the interaction of:
geometric features of the objects, the shade/shadow regions
on the objects (and specularity), and the (apparent) contours
resulting from viewer direction. Visual recognition often also
involves a use of small movement in the viewer direction to decide
among ambiguities in the local clues.
We give the classification of the local configurations
for case of a fixed single light source and perfectly diffuse
objects (without specular effects) having generic local geometric
features (edges, creases, corners and marking curves).
The classification includes both the "stable views", in
which the configurations do not change under small viewer movement,
and the generic transitions in local configurations which a
viewer expects to see under larger viewer movement.
The classification of stable views reduces to an
"alphabet" of local curve configurations, where certain ones have
multiple interpretations.
The classification is obtained by applying a rigorous
mathematical analysis using methods from singularity theory.
These results describe joint work of the presenter with
Peter Giblin and Gareth Haslinger.
Oliver Dorn (Universidad Carlos III de Madrid)
, Rossmary Villegas (Universidad Carlos III de Madrid)
Simultaneous reconstruction and segmentation of images from two-phase
fluid flow data using a level set technique
In many applications the task appears to reconstruct a segmented image
from indirectly measured data. Examples are medical applications,
e.g. creating tomographic images from electromagnetic or X-ray data,
or geophysical application as for example finding landmines or
characterizing a petroleum reservoir. We will focus here on the latter
application, even though our method will be quite general. Typically,
in these applications first a pixel-based image is created using
standard reconstruction techniques. Then, segmentation methods are
applied to this image. The drawback of this two-step approach is that
most segmentation tools change the reconstructed images without taking
into account the original data. It would be desirable in practical
applications to construct a segmented image directly from the data. We
present here a novel reconstruction scheme which is able to achieve
this goal. It combines a level set technique for segmenting the images
during the reconstruction from the data and a pixel-based correction
scheme in each of the segmented regions. We show that this novel
technique is able to provide segmented images which minimize a least
squares data misfit cost functional.
This is a collaborative project of Universidad Carlos III de Madrid
and Repsol-YPF, Spain.
University partners: Rossmary Villegas, Oliver Dorn, Manuel
Kindelan, Miguel Moscoso (UC3M). Industrial partners:
Elena Izaguirre, Francisco Mustieles (Repsol-YPF, Spanish oil company).
Alyosha Efros (Carnegie-Mellon University) http://www.cs.cmu.edu/~efros/, William T. Freeman (Massachusetts Institute of Technology)
Using Multiple Segmentations to Discover Objects and their
Extent in Image Collections
Given a large dataset of images, we seek to automatically discover
visually similar object classes, together with their spatial extent
(segmentation). The hope is that we will be able to automatically
recover commonly occurring objects, such as cars, trees, buildings,
etc. Our approach is to first obtain multiple segmentations of each
image, and to make the assumption that each object instance is
correctly segmented by at least one segmentation. The problem is then
reduced to finding clusters of correctly segmented objects within this
large "segment soup," i.e. one of grouping in the space of candidate
image segments.
The main insight of the paper is that segments corresponding to
objects will be exactly the ones represented by coherent clusters,
whereas segments overlapping object boundaries will need to be
explained by a mixture of several clusters. To paraphrase Leo Tolstoy:
all good segments are alike, each bad segment is bad in its own way.Joint work with Bryan Russell, Josef Sivic and Andrew Zisserman
Sanja Fidler (University of Ljubljana)
Hierarchical Statistical Learning of Generic Parts of
Object Structure
In collaboration with Leonardis, Ales, and Berginc, Gregor.
With the growing interest in object categorization various
methods
have emerged that perform well in this challenging task, yet
are inherently limited to only a moderate number of object classes.
In pursuit of a more general categorization system our framework
proposes a way to overcome the computational complexity
encompassing the enormous number of different object categories
by exploiting the statistical properties of the highly structured
visual world. Our approach proposes a hierarchical acquisition
of generic parts of object structure, varying from simple to more
complex ones, which stem from the favorable statistics of
natural images. The parts recovered in the individual
layers of the hierarchy can be used in a top-down manner
resulting in a robust statistical engine that could be
efficiently used within many of the current
categorization systems. The proposed approach has been applied
to large image datasets yielding important statistical insights
into the generic parts of object structure.
We investigate the learning of the appearance of an object from a
single image of it. Instead of using a large number of pictures of
the object to recognize, we use a labeled reference database of
pictures of other objects to learn invariance to noise and
variations in pose and illumination. This acquired knowledge is then
used to predict if two pictures of new objects, which do not appear
on the training pictures, actually display the same object.
We propose a generic scheme called chopping to address this
task. It relies on hundreds of random binary splits of the training
set chosen to keep together the images of any given object. Those
splits are extended to the complete image space with a simple
learning algorithm. Given two images, the responses of the split
predictors are combined with a Bayesian rule into a posterior
probability of similarity.
Experiments with the COIL-100 database and with a database of 150
degraded LaTeX symbols compare our method to a classical learning
with several examples of the positive class and to a direct learning
of the similarity.
Stuart Geman (Brown University)
, Pietro Perona (California Institute of Technology)
, Tomaso Poggio (Massachusetts Institute of Technology)
, Alain Trouve (École Normale Supérieure de Cachan)
Hierarchies of Parts Advocates: Poggio, Geman Critics: Perona, Trouve
Bayesian Registration for Anatomical Landmark Detection
We propose an algorithmic solution for simultaneous detection of
landmarks in brain MRI. A landmark is a point of the image that
corresponds to a well-defined point in the anatomy and characterizes the
local geometry of the brain.
The location of the landmarks is identified with a unique deformation of
the underlying 3D space. We consider a set of non-rigid deformations
where the landmarks act as control points, extending the deformation to
the whole domain by spline interpolation.
We build a probabilistic model for the intensities of the MR image given
the landmark locations. We use a training set of hand-landmarked images
to estimate the parameters of the model. The resulting atlas is sharp in
the vicinity of the landmarks, where the deformation is given, and more
diffuse at greater distance of the control points.
In a new image, the landmark locations are estimated using a gradient
ascent algorithm on the likelihood function. It produces a partial
registration, which is more accurate in the vicinity of the landmarks.
We applied the algorithm to the localization of three landmarks
belonging to the
hippocampus.
David Jacobs (University of Maryland)
Deformation-Invariant Image Matching and Plant Species Discovery
In vision, it is important to match regions or objects that are not
related by simple rigid or linear transformations. For example, looking
at a 3D surface from different viewpoints causes complex deformations in
its appearance. When the parts of objects articulate, their shape
changes non-linearly. When we compare different instances of objects
from the same class, their shape may vary in complex ways. To capture
deformations, we use a framework in which we embed images as 2D surfaces
in 3D. We then show that as we vary the parameters of this embedding,
geodesic distances on the surface become deformation-invariant. This
allows us to build deformation-invariant descriptors using geodesic
distance. For binary shapes, we develop related descriptors, using the
inner distance, which is invariant to articulations and captures the part
structure of objects. We evaluate these descriptors on a number of data
sets. In particular, the inner distance forms the basis of a shape
comparison method that we use to identify the species of plants from the
shape of their leaves. This is being used in a project, with the
Smithsonian Institute and Columbia University, to develop an Electronic
Field Guide that botanists can use to identify new species of plants.
Svetlana Lazebnik (University of Illinois at Urbana-Champaign)
Beyond Bags of Features: Spatial Pyramid Matching for Recognizing
Natural Scene Categories
We present a method for recognizing scene categories based on
approximate global geometric correspondence. This method works
by partitioning the image into increasingly fine sub-regions and
computing histograms of local features found inside each sub-region.
The resulting ``spatial pyramid'' is a simple and computationally
efficient extension of an orderless bag-of-features image
representation, and it shows significantly improved performance
on challenging scene categorization tasks. Specifically, our
proposed method exceeds most previously published methods on the
Caltech-101 database and achieves high accuracy on a large
database of fifteen natural scene categories. The spatial pyramid
framework also offers insights into the success of several recently
proposed image descriptions, including Torralba's "gist" and
Lowe's SIFT descriptors.
Learning the Features: Supervised versus Unsupervised
Hierarchical object recognition models employ a number of different
strategies to learn the feature detectors at the various levels:
fragment selection by mutual information in Ulman's model, Fixed Gabor
wavelets and fragment selection in Poggio's model, supervised gradient
descent in LeCun's convolutional nets, layer-by-layer unsupervised
learning followed by supervised gradient descent in Hinton's
stacked restricted Boltzmann machine model.
We will present a new unsupervised algorithm for feature learning, and
compare classification performance obtained by training a
convolutional net in the conventional (supervised) way, with the same
convolutional net where the features have been initialized using the
new unsupervised method.
Juan Carlos Niebles (University of Illinois at Urbana-Champaign)
Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words
(Project page)
We present a novel unsupervised learning method for human action
categories. A video sequence is represented as a collection of
spatial-temporal words by extracting space-time interest points. The
algorithm learns the probability distributions of the spatial-temporal
words and intermediate topics corresponding to human action categories
automatically using a probabilistic Latent Semantic Analysis (pLSA)
model. The learned model is then used for human action categorization
and localization in a novel video. We test our algorithm on two
datasets: the KTH human action dataset and a recent dataset of ï¬^Águre
skating actions. Our results are on par or slightly better than the best
reported results. In addition, our algorithm can recognize and localize
multiple actions in long and complex video sequences containing multiple
motions.
Tomaso Poggio (Massachusetts Institute of Technology)
Models as tools in neuroscience: Object recognition in cortex at CBCL
The classical model of visual processing in cortex is a hierarchy of
increasingly sophisticated representations, extending the Hubel and
Wiesel model of simple to complex cells in a natural way. Neuroscience
work in the Poggio Lab focuses on a computational model of object
recognition which is based on these principles and consistent with a
number of recent physiological and psychophysical experiments. The
goal of our work is to explain cognitive phenomena in terms of simple
and well-understood computational processes in a physiologically
plausible model. Examples of ongoing projects using our computational
model to motivate experiments and guide the analysis of experimental
data include
1) read-out and classification of IT neural data to
investigate the neural basis of object categorization (in
collaboration with the DiCarlo lab and Miller labs),
2) simulaton of biophysically plausible mechanisms for the key operations
underlying object recognition in the model (in collaboration with
Christof Koch at Caltech and David Ferster at Northwestern),
3) analysis and prediction of neural data in intermediate visual areas,
and
4) comparison of model performance to humans as well as
state-of-the-art machine vision systems (in collaboration with the
Oliva lab).
Joint work with Ulf Knoblich, Minjoon Kouh, Gabriel Kreiman, Ethan Meyers, and Thomas Serre.
Manabu Tanifuji (The Institute of Physical and Chemical Research (RIKEN))
Object image representation in area TE of macaque monkeys
Information about object images is transmitted from the primary visual
cortex to the inferotemporal (IT) cortex through multiple pre-striate
areas in macaque monkeys. To understand neural mechanisms for object
recognition, we investigate neuronal representation of object images in
area TE of macaque monkeys.
To achieve this goal, we combined single cellular recording techniques
and optical imaging techniques that enable us to visualize neural
activation at columnar levels. It is essential to use imaging techniques
because previously it has been shown that neurons in area TE respond to
geometrically less complex features rather than to more complex real
objects.
Major conclusions are: (1)an object is represented by a combination of
cortical columns, each of which represents a visual feature, (2) a
specific combination for an object is made up of both active and inactive
feature columns, (3) the feature columns do not necessarily represent
spatially localized visual features, and (4) some visual features
represented by columns are more related to global features, such as
spatial arrangements of parts.
For two decades, techniques based on Partial Differential Equations (PDEs) have
been used in monochrome and color image processing for image segmentation,
restoration, smoothing and multiscale image representation. Among these
techniques, parabolic PDEs have found a lot of attention for image smoothing
and image restoration purposes. Image smoothing by parabolic PDEs can be seen
as a continuous transformation of the original image into a space of
progressively smoother images identified by the "scale" or level of image
smoothing. The semantically meaningful objects in an image can be of any size,
that is, they can be located at different image scales, in the continuum
scale-space generated by the PDE. The adequate selection of an image scale
smoothes out undesirable variability that at lower scales constitute a source
of error in segmentation and classification algorithms. This paper proposes a
framework for generating a scale space representation for a hyperspectral image
using PDE methods. We illustrate some of our ideas by hyperspectral image
smoothing using nonlinear diffusion. The extension of scalar nonlinear
diffusion to hyperspectral imagery and a discussion of how the spectral and
spatial domains are transformed in the scale space representation are
presented.
Christian Wallraven (Max-Planck-Institut für Biologische Kybernetik)
Object recognition: Integrating psychophysics, computer vision and machine learning
Despite several decades of research in the field of computer vision, there still exists no recognition system which is able to match the
visual performance of humans. The apparent ease with which visual tasks such as recognition and categorization are solved by humans is
testimony of a highly optimized visual system which not only exhibits excellent robustness and generalization capabilities but is in
addition highly flexible in learning and organizing new data. Using an integrative approach to the problem of object recognition we have
developed a framework that combines cognitive psychophysics, computer vision as well as machine learning. This framework is able to model
results from psychophysics and, in addition, delivers excellent recognition performance in computational recognition experiments.
Furthermore, the framework also interfaces well with advanced classification schemes from machine learning thus further broadening the
scope of application.
The framework is derived from and motivated by results from cognitive research - in particular on the dynamic aspect of recognition
processes, the contribution of which has up to now been largely neglected in theoretical modeling of recognition. Several recent
experiments in perceptual research have found the temporal dimension to play a large role in object recognition - both by being able to
mediate learning of object representations and by providing an integral part of the representation itself. In addition, we have conducted
experiments that shed light on how objects might be represented in the brain using local pictorial features and their spatial relations
thus forming a sparse and at the same time structured object representation. Based on these results, the computational implementation of
the framework provides spatiotemporal processing of visual input by means of a structured, appearance-based object representation.
We have shown that several experiments can successfully be modeled using a computational implementation of the framework, which
demonstrates the perceptual plausibility of spatio-temporal, local feature processing. In addition, based on the computational modeling
results, a number of performance predictions can be made that can be tested in further psychophysical experiments thus closing the loop
between experimental work and computational modeling.
In addition, we have demonstrated the benefits of using spatio-temporal information in several recognition experiments on both artificial
and real-world data in several computational recognition experiments (including experiments run on a robotic platform). In this context,
we have developed a novel method for combining spatio-temporal object representations based on local features and state-of-the-art
statistical learning methods (Support Vector Machines). Several recognition experiments have shown that by combining efficient
representations from computational vision with robust classification schemes from machine learning, excellent recognition performance can
be achieved.In this paper, we present an overview of the psychophysical and computational experiments and discuss the benefits and limitations of the
proposed framework.
Song Chun Zhu (University of California)
Mapping the Visual Vocabulary—A 'Genome' Project of Vision?
In a private vision quest meeting hold in Maine many years ago, an
artist
Jennifer Mumford—wife of David Mumford—divided the audience into
pairs.
People who faced the screen were asked to describe the contents of a
picture to those who looked away from the screen. The latter then
compared the pictures visualized in their minds based on what they heard
against the real picture on the screen. It was of no surprise that a
thousand English words were not enough for a picture. Simply we don't
have the right words to describe natural images yet.
If humans do not know at the first place, how do we program computers to
understand a large set of images?
Many people would argue, hey, shouldn't computers learn to understand
images by themselves in an unsupervised way, especially in a
discriminative task for image classification?
But, look, no parents in the world would send their kids to schools that
have no teachers, no generative syllabus to explain how the world works,
but only have endless quizzes and final exams, unless they cannot afford
any quality education for their kids.
In 2005, a number of vision people, established a non-profit
organization—the Lotus Hill Institute in Ezhou, China—held a 1 week workshop, and
discussed the "image ground truth", sponsored by NSF of both USA and
China. One primal objective of the Lotus Hill efforts is to map the
visual vocabulary in a huge set of natural images, and to develop a
common generative model for image. This work could be in spirit compared
to the Genome project in biology, though in a much smaller scale due to
the lack of funding. The institute currently has 37 employees including
17 art people for image annotation. This poster will report some
examples of the parsed images. We call for broad collaborations and
contributions from the CVPR community.
Acknowledgement: David Mumford, Alan Yuille, Harry Shum, Fei-Fei Li,
David Martin, Charless Fowlkes, Yanxi Liu, ..., and many others in China
Xiaoming Fan, Zhenyu Yao, Liang Lin, Tianfu Wu, Feng Min, ... We welcome
more people to participate. The website of Lotus Hill Institute is at
www.lotushill.org.