HOME    »    SCIENTIFIC RESOURCES    »    Volumes
Abstracts and Talk Materials
Visual Learning and Recognition
May 22 - 26, 2006

Yali Amit (University of Chicago)
Yann LeCun (New York University)

The Limits of Learning
Panelists: Amit, LeCun

No abstract.

Yali Amit (University of Chicago)
Jitendra Malik (University of California, Berkeley)
Yair Weiss (Hebrew University)
Song Chun Zhu (University of California, Los Angeles)

Generative Models
Advocates: Zhu, Amit
Critics: Malik, Weiss

No abstract.

Daniel Boley (University of Minnesota, Twin Cities)
Dongwei Cao (University of Minnesota, Twin Cities)

Motion Recognition with a Convolution Kernel

We address the problem of human motion recognition in this paper. The goal of human motion recognition is to recognize the type of motion recorded in a video clip, which consists of a set of temporarily ordered frames. By defining a Mercer kernel between two video clips directly, we propose in this paper a recognition strategy that can incorporate both the information of each individual frame and the temporal ordering between frames. Combining the proposed kernel with the support vector machine, which is one of the most effective classification paradigms, the resulting recognition strategy exhibits excellent performance over real data sets.

Joint work of Dongwei Cao, Osama T Masoud, Daniel Boley, and Nikolaos Papanikolopoulos

Joachim M. Buhmann (Eidgenössische TH Hönggerberg)
Pedro F. Felzenszwalb (University of Chicago)
Jitendra Malik (University of California, Berkeley)

Flexible Templates
Advocates: Malik, Felzenzwalb
Critic: Buhmann

No abstract.

Joachim M. Buhmann (Eidgenössische TH Hönggerberg)

Learning Compositional Categorization Models

Joint work with Björn Ommer.

This contribution proposes a compositional approach to visual object categorization of scenes. Compositions are learned from the CalTech 101 database and form intermediate abstractions of images that are semantically situated between low-level representations and the highlevel categorization. Salient regions, which are described by localized feature histograms, are detected as image parts. Subsequently compositions are formed as bags of parts with a locality constraint. Image categorization is finally achieved by applying coupled probabilistic kernel classifiers to the bag of compositions representation of a scene. In contrast to the discriminative training of the categorizer, intermediate compositions are learned in a generative manner yielding relevant part agglomerations, i.e. groupings which are frequently appearing in the dataset while simultaneously supporting the discrimination between sets of categories. Consequently, compositionality simplifies the learning of a complex categorization model for complete scenes by splitting it up into simpler, sharable compositions. The architecture is evaluated on the highly challenging CalTech 101 database which exhibits large intra-category variations. Our compositional approach yields a significant enhancement over a baseline model with the same feature representation but without compositions. It shows competitive retrieval rates in the range of 52.2±2.6% on the Caltech 101 data base.

James Damon (University of North Carolina, Chapel Hill)

Local Views of Illuminated Surfaces

Visual clues for the shapes of objects and their positions involve the interaction of: geometric features of the objects, the shade/shadow regions on the objects (and specularity), and the (apparent) contours resulting from viewer direction. Visual recognition often also involves a use of small movement in the viewer direction to decide among ambiguities in the local clues. We give the classification of the local configurations for case of a fixed single light source and perfectly diffuse objects (without specular effects) having generic local geometric features (edges, creases, corners and marking curves). The classification includes both the "stable views", in which the configurations do not change under small viewer movement, and the generic transitions in local configurations which a viewer expects to see under larger viewer movement. The classification of stable views reduces to an "alphabet" of local curve configurations, where certain ones have multiple interpretations. The classification is obtained by applying a rigorous mathematical analysis using methods from singularity theory.

These results describe joint work of the presenter with Peter Giblin and Gareth Haslinger.

Oliver Dorn (University of Manchester)
Rossmary Villegas (Heriot-Watt University)

Simultaneous reconstruction and segmentation of images from two-phase fluid flow data using a level set technique

In many applications the task appears to reconstruct a segmented image from indirectly measured data. Examples are medical applications, e.g. creating tomographic images from electromagnetic or X-ray data, or geophysical application as for example finding landmines or characterizing a petroleum reservoir. We will focus here on the latter application, even though our method will be quite general. Typically, in these applications first a pixel-based image is created using standard reconstruction techniques. Then, segmentation methods are applied to this image. The drawback of this two-step approach is that most segmentation tools change the reconstructed images without taking into account the original data. It would be desirable in practical applications to construct a segmented image directly from the data. We present here a novel reconstruction scheme which is able to achieve this goal. It combines a level set technique for segmenting the images during the reconstruction from the data and a pixel-based correction scheme in each of the segmented regions. We show that this novel technique is able to provide segmented images which minimize a least squares data misfit cost functional.

This is a collaborative project of Universidad Carlos III de Madrid and Repsol-YPF, Spain. University partners: Rossmary Villegas, Oliver Dorn, Manuel Kindelan, Miguel Moscoso (UC3M). Industrial partners: Elena Izaguirre, Francisco Mustieles (Repsol-YPF, Spanish oil company).

Alyosha Efros (Carnegie-Mellon University)
William T. Freeman (Massachusetts Institute of Technology)

Using Multiple Segmentations to Discover Objects and their Extent in Image Collections

Given a large dataset of images, we seek to automatically discover visually similar object classes, together with their spatial extent (segmentation). The hope is that we will be able to automatically recover commonly occurring objects, such as cars, trees, buildings, etc. Our approach is to first obtain multiple segmentations of each image, and to make the assumption that each object instance is correctly segmented by at least one segmentation. The problem is then reduced to finding clusters of correctly segmented objects within this large "segment soup," i.e. one of grouping in the space of candidate image segments.

The main insight of the paper is that segments corresponding to objects will be exactly the ones represented by coherent clusters, whereas segments overlapping object boundaries will need to be explained by a mixture of several clusters. To paraphrase Leo Tolstoy: all good segments are alike, each bad segment is bad in its own way.

Joint work with Bryan Russell, Josef Sivic and Andrew Zisserman

Sanja Fidler (University of Ljubljana)

Hierarchical Statistical Learning of Generic Parts of Object Structure

In collaboration with Leonardis, Ales, and Berginc, Gregor.

With the growing interest in object categorization various methods have emerged that perform well in this challenging task, yet are inherently limited to only a moderate number of object classes. In pursuit of a more general categorization system our framework proposes a way to overcome the computational complexity encompassing the enormous number of different object categories by exploiting the statistical properties of the highly structured visual world. Our approach proposes a hierarchical acquisition of generic parts of object structure, varying from simple to more complex ones, which stem from the favorable statistics of natural images. The parts recovered in the individual layers of the hierarchy can be used in a top-down manner resulting in a robust statistical engine that could be efficiently used within many of the current categorization systems. The proposed approach has been applied to large image datasets yielding important statistical insights into the generic parts of object structure.

Francois Fleuret (École Polytechnique Fédérale de Lausanne (EPFL))

Pattern Recognition from One Example by Chopping

We investigate the learning of the appearance of an object from a single image of it. Instead of using a large number of pictures of the object to recognize, we use a labeled reference database of pictures of other objects to learn invariance to noise and variations in pose and illumination. This acquired knowledge is then used to predict if two pictures of new objects, which do not appear on the training pictures, actually display the same object.

We propose a generic scheme called chopping to address this task. It relies on hundreds of random binary splits of the training set chosen to keep together the images of any given object. Those splits are extended to the complete image space with a simple learning algorithm. Given two images, the responses of the split predictors are combined with a Bayesian rule into a posterior probability of similarity.

Experiments with the COIL-100 database and with a database of 150 degraded LaTeX symbols compare our method to a classical learning with several examples of the positive class and to a direct learning of the similarity.

Stuart Geman (Brown University)
Pietro Perona (California Institute of Technology)
Tomaso Poggio (Massachusetts Institute of Technology)
Alain Trouve (École Normale Supérieure de Cachan)

Hierarchies of Parts
Advocates: Poggio, Geman
Critics: Perona, Trouve

No abstract.

Daniel Huttenlocher (Cornell University)
Svetlana Lazebnik (University of Illinois at Urbana-Champaign)

Invariant Local Descriptors
Advocate: Lazebnik
Critic: Huttenlocher

No abstract.

Camille Izard (Johns Hopkins University)
Bruno Jedynak (Johns Hopkins University)

Bayesian Registration for Anatomical Landmark Detection

We propose an algorithmic solution for simultaneous detection of landmarks in brain MRI. A landmark is a point of the image that corresponds to a well-defined point in the anatomy and characterizes the local geometry of the brain. The location of the landmarks is identified with a unique deformation of the underlying 3D space. We consider a set of non-rigid deformations where the landmarks act as control points, extending the deformation to the whole domain by spline interpolation.

We build a probabilistic model for the intensities of the MR image given the landmark locations. We use a training set of hand-landmarked images to estimate the parameters of the model. The resulting atlas is sharp in the vicinity of the landmarks, where the deformation is given, and more diffuse at greater distance of the control points. In a new image, the landmark locations are estimated using a gradient ascent algorithm on the likelihood function. It produces a partial registration, which is more accurate in the vicinity of the landmarks. We applied the algorithm to the localization of three landmarks belonging to the hippocampus.

David Jacobs (University of Maryland)

Deformation-Invariant Image Matching and Plant Species Discovery

In vision, it is important to match regions or objects that are not related by simple rigid or linear transformations. For example, looking at a 3D surface from different viewpoints causes complex deformations in its appearance. When the parts of objects articulate, their shape changes non-linearly. When we compare different instances of objects from the same class, their shape may vary in complex ways. To capture deformations, we use a framework in which we embed images as 2D surfaces in 3D. We then show that as we vary the parameters of this embedding, geodesic distances on the surface become deformation-invariant. This allows us to build deformation-invariant descriptors using geodesic distance. For binary shapes, we develop related descriptors, using the inner distance, which is invariant to articulations and captures the part structure of objects. We evaluate these descriptors on a number of data sets. In particular, the inner distance forms the basis of a shape comparison method that we use to identify the species of plants from the shape of their leaves. This is being used in a project, with the Smithsonian Institute and Columbia University, to develop an Electronic Field Guide that botanists can use to identify new species of plants.

Svetlana Lazebnik (University of Illinois at Urbana-Champaign)

Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories

We present a method for recognizing scene categories based on approximate global geometric correspondence. This method works by partitioning the image into increasingly fine sub-regions and computing histograms of local features found inside each sub-region. The resulting ``spatial pyramid'' is a simple and computationally efficient extension of an orderless bag-of-features image representation, and it shows significantly improved performance on challenging scene categorization tasks. Specifically, our proposed method exceeds most previously published methods on the Caltech-101 database and achieves high accuracy on a large database of fifteen natural scene categories. The spatial pyramid framework also offers insights into the success of several recently proposed image descriptions, including Torralba's "gist" and Lowe's SIFT descriptors.

Yann LeCun (New York University)

Learning the Features: Supervised versus Unsupervised

Hierarchical object recognition models employ a number of different strategies to learn the feature detectors at the various levels: fragment selection by mutual information in Ulman's model, Fixed Gabor wavelets and fragment selection in Poggio's model, supervised gradient descent in LeCun's convolutional nets, layer-by-layer unsupervised learning followed by supervised gradient descent in Hinton's stacked restricted Boltzmann machine model.

We will present a new unsupervised algorithm for feature learning, and compare classification performance obtained by training a convolutional net in the conventional (supervised) way, with the same convolutional net where the features have been initialized using the new unsupervised method.

Juan Carlos Niebles (University of Illinois at Urbana-Champaign)

Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words
Project page)

We present a novel unsupervised learning method for human action categories. A video sequence is represented as a collection of spatial-temporal words by extracting space-time interest points. The algorithm learns the probability distributions of the spatial-temporal words and intermediate topics corresponding to human action categories automatically using a probabilistic Latent Semantic Analysis (pLSA) model. The learned model is then used for human action categorization and localization in a novel video. We test our algorithm on two datasets: the KTH human action dataset and a recent dataset of ï¬^Águre skating actions. Our results are on par or slightly better than the best reported results. In addition, our algorithm can recognize and localize multiple actions in long and complex video sequences containing multiple motions.

Tomaso Poggio (Massachusetts Institute of Technology)

Models as tools in neuroscience: Object recognition in cortex at CBCL

The classical model of visual processing in cortex is a hierarchy of increasingly sophisticated representations, extending the Hubel and Wiesel model of simple to complex cells in a natural way. Neuroscience work in the Poggio Lab focuses on a computational model of object recognition which is based on these principles and consistent with a number of recent physiological and psychophysical experiments. The goal of our work is to explain cognitive phenomena in terms of simple and well-understood computational processes in a physiologically plausible model. Examples of ongoing projects using our computational model to motivate experiments and guide the analysis of experimental data include 1) read-out and classification of IT neural data to investigate the neural basis of object categorization (in collaboration with the DiCarlo lab and Miller labs), 2) simulaton of biophysically plausible mechanisms for the key operations underlying object recognition in the model (in collaboration with Christof Koch at Caltech and David Ferster at Northwestern), 3) analysis and prediction of neural data in intermediate visual areas, and 4) comparison of model performance to humans as well as state-of-the-art machine vision systems (in collaboration with the Oliva lab).

Joint work with Ulf Knoblich, Minjoon Kouh, Gabriel Kreiman, Ethan Meyers, and Thomas Serre.

Manabu Tanifuji (The Institute of Physical and Chemical Research (RIKEN))

Object image representation in area TE of macaque monkeys

Information about object images is transmitted from the primary visual cortex to the inferotemporal (IT) cortex through multiple pre-striate areas in macaque monkeys. To understand neural mechanisms for object recognition, we investigate neuronal representation of object images in area TE of macaque monkeys.

To achieve this goal, we combined single cellular recording techniques and optical imaging techniques that enable us to visualize neural activation at columnar levels. It is essential to use imaging techniques because previously it has been shown that neurons in area TE respond to geometrically less complex features rather than to more complex real objects.

Major conclusions are: (1)an object is represented by a combination of cortical columns, each of which represents a visual feature, (2) a specific combination for an object is made up of both active and inactive feature columns, (3) the feature columns do not necessarily represent spatially localized visual features, and (4) some visual features represented by columns are more related to global features, such as spatial arrangements of parts.

Christian Wallraven (Max-Planck-Institut für Biologische Kybernetik)

Object recognition: Integrating psychophysics, computer vision and machine learning

Despite several decades of research in the field of computer vision, there still exists no recognition system which is able to match the visual performance of humans. The apparent ease with which visual tasks such as recognition and categorization are solved by humans is testimony of a highly optimized visual system which not only exhibits excellent robustness and generalization capabilities but is in addition highly flexible in learning and organizing new data. Using an integrative approach to the problem of object recognition we have developed a framework that combines cognitive psychophysics, computer vision as well as machine learning. This framework is able to model results from psychophysics and, in addition, delivers excellent recognition performance in computational recognition experiments. Furthermore, the framework also interfaces well with advanced classification schemes from machine learning thus further broadening the scope of application.

The framework is derived from and motivated by results from cognitive research - in particular on the dynamic aspect of recognition processes, the contribution of which has up to now been largely neglected in theoretical modeling of recognition. Several recent experiments in perceptual research have found the temporal dimension to play a large role in object recognition - both by being able to mediate learning of object representations and by providing an integral part of the representation itself. In addition, we have conducted experiments that shed light on how objects might be represented in the brain using local pictorial features and their spatial relations thus forming a sparse and at the same time structured object representation. Based on these results, the computational implementation of the framework provides spatiotemporal processing of visual input by means of a structured, appearance-based object representation. We have shown that several experiments can successfully be modeled using a computational implementation of the framework, which demonstrates the perceptual plausibility of spatio-temporal, local feature processing. In addition, based on the computational modeling results, a number of performance predictions can be made that can be tested in further psychophysical experiments thus closing the loop between experimental work and computational modeling. In addition, we have demonstrated the benefits of using spatio-temporal information in several recognition experiments on both artificial and real-world data in several computational recognition experiments (including experiments run on a robotic platform). In this context, we have developed a novel method for combining spatio-temporal object representations based on local features and state-of-the-art statistical learning methods (Support Vector Machines). Several recognition experiments have shown that by combining efficient representations from computational vision with robust classification schemes from machine learning, excellent recognition performance can be achieved.

In this paper, we present an overview of the psychophysical and computational experiments and discuss the benefits and limitations of the proposed framework.

Song Chun Zhu (University of California, Los Angeles)

Mapping the Visual Vocabulary—A 'Genome' Project of Vision?

In a private vision quest meeting hold in Maine many years ago, an artist Jennifer Mumford—wife of David Mumford—divided the audience into pairs. People who faced the screen were asked to describe the contents of a picture to those who looked away from the screen. The latter then compared the pictures visualized in their minds based on what they heard against the real picture on the screen. It was of no surprise that a thousand English words were not enough for a picture. Simply we don't have the right words to describe natural images yet.

If humans do not know at the first place, how do we program computers to understand a large set of images?

Many people would argue, hey, shouldn't computers learn to understand images by themselves in an unsupervised way, especially in a discriminative task for image classification?

But, look, no parents in the world would send their kids to schools that have no teachers, no generative syllabus to explain how the world works, but only have endless quizzes and final exams, unless they cannot afford any quality education for their kids.

In 2005, a number of vision people, established a non-profit organization—the Lotus Hill Institute in Ezhou, China—held a 1 week workshop, and discussed the "image ground truth", sponsored by NSF of both USA and China. One primal objective of the Lotus Hill efforts is to map the visual vocabulary in a huge set of natural images, and to develop a common generative model for image. This work could be in spirit compared to the Genome project in biology, though in a much smaller scale due to the lack of funding. The institute currently has 37 employees including 17 art people for image annotation. This poster will report some examples of the parsed images. We call for broad collaborations and contributions from the CVPR community.

Acknowledgement: David Mumford, Alan Yuille, Harry Shum, Fei-Fei Li, David Martin, Charless Fowlkes, Yanxi Liu, ..., and many others in China Xiaoming Fan, Zhenyu Yao, Liang Lin, Tianfu Wu, Feng Min, ... We welcome more people to participate. The website of Lotus Hill Institute is at www.lotushill.org.

Connect With Us: