International Media Technology Workshop on Abstract Perception - Proceedings

International Media Research Foundation

Japan Advanced Institute for Science and Technology

January 12-14, 1994

Edited by David Rosenthal

_______________________________________________________________________________________________________

Table of Contents

Introduction

Recent research has resulted in a number of computational systems that emulate of some aspect of human perception. Most of these can be broadly classified as machine vision or machine audition systems. The purpose of the Abstract Perception Workshop was to explore a straightforward idea: that these various systems have some interesting common aspects, and that understanding these commonalties can lead to improved understanding of how perception works in humans, and how we can endow computers with similar abilities.

To further our understanding of these issues, we gathered an international group of researchers in various fields related to computer perception for three days of lectures and intensive discussion. The event consisted of two main parts: The first day was taken up with a symposium at the JAIST-Ishikawa Hi-Tech Center, at which four American researchers, two each from audition and vision, presented overviews of some of their recent research and ideas. The symposium was open to the public, and was attended by an audience of about 200, mostly drawn from the Ishikawa-area research community. The second part consisted of a closed workshop held at a nearby ryokan. Over the two days following the symposium, about twenty researchers in vision and audition essentially compared research notes and explored issues of commonalty among their fields. The workshop consisted (formally) of three planned sessions plus an extraordinary session led by Professor Nawab of Boston University.

This document consists of summary accounts of the symposium and the four sessions. It also includes some notes from individual interviews that were conducted with most of the participants, followed by some of my own conclusions about overall directions and future work.

[Note: The diagram on the cover is a representation of the sound of a clarinet being interrupted by a can dropping on the floor, produced by Dan Ellis' dspB sound understanding system.]

Acknowledgments

I'd like to sincerely and gratefully acknowledge the contributions of the following people:

IMRF Board of Directors Chairman Minoru Yoshino, Secretary General Seichi Koizumi, and Dr. Edward David, for their generous support and encouragement of this project,

Prof. Susumu Horiguchi and the JAIST staff for chairing and running the symposium, for helping to organize the workshop, and for able and indefatigable support throughout the planning and execution of this event,

Mr. Kazuo Ohno, Mr. Keiji Yumiyama, Ms. Yoshiko Sasaki, Ms. Yoko Koizumi, for well-thought-out administration and numerous behind-the-scenes activities that were crucial to the success of the workshop,

Mr. Kunio Kashino, for the contribution of his ideas and knowledge of the Japanese research world, and for help during the workshop,

Mr. Dan Ellis and Mr. Alan Ruttenberg, for their constant help and encouragement,

Professor Takashi Matsuyama and Professor Seiji Inokuchi, for sage advice during the planning stages and invaluable support,

Symposium presenters Professor Edward Adelson, Professor Aaron Bobick, Professor Hamid Nawab, Dr. Thomas Knight, for their interesting and well-planned presentations that made the symposium a success,

The workshop participants, who made time in their busy schedules to travel to and participate in the workshop, for their stimulating and insightful contributions.

Symposium Summaries

Representation in Dynamic Scenes

Prof. Aaron Bobick, MIT Media Lab

Most vision research to date has used static images as input. In this talk Prof. Bobick considered some of the representational issues that come up when we try to understand dynamic images - images which change over time.

We'll consider three application domains: autonomous navigation (where the world is static and the observer is fixed), ballet choreography (fixed observer, moving world), and automatic interpretation of (American) football plays (where both observer and world are in motion).

Part I: Autonomous Navigation:

The goal in this project was to develop a system which could use vision to build a structural description of the world which was robust enough to use for navigating a land vehicle. The basic system is as follows: The autonomous land vehicle (ALV) is provided with a rough 3D model of the terrain to be navigated. The model contains gross terrain features such as large trees or rocks; it's derived from a stereoscopic aerial photograph. Mounted on the vehicle is a range finder which provides range data to the navigation system every half second. The system uses the range finder data to fill in the rough model with enough detail about smaller features (rocks, bushes, etc.) to navigate.

The basic idea is to "integrate" the range data over time to avoid gross errors. An aggressive segmenting system is used, i.e., whenever a range image indicates that an object might be present, the system conditionally hypothesizes that it's there; if it is in fact present, its existence will be confirmed in subsequent images, otherwise the hypothesis can be discarded.

This means that a key element of this system is "keeping track" of everything - knowing, for example, whether a blob detected in some image is one that we've detected in a previous image, or a new one.

To handle this kind of situation we need to use object descriptions which can evolve, as shown in this figure. When an object is initially detected by the range finder, it will only be a few pixels in size; we can't say much about its attributes. As subsequent information is obtained, it can be more richly represented - first, we may get some information about its 3D position, then we may get information about its size, then its texture and shape. Still later, we start to be able to build a standard computer-graphics model of the object, where we can talk about its parts and structure.

A key notion in this system is that of stability - that is, as more sensor input about an object is obtained, it should evolve in a coherent way toward a single part model. Knowledge-based processing can enhance stability: when an object description fails to evolve in a stable manner, the system may be able to explain the problem in terms of a sensor characteristic, for example.

The ALV navigation problem involves multiple tasks, such as correcting the initially provided model, localizing the vehicle, avoiding obstacles, predicting what objects will look like from other points of view, and recognizing objects that have already been detected. Each of these tasks requires a different level of representation; to recognize an object from a different point of view, we need a detailed part model, while to localize the vehicle, a simpler representation may suffice.

Conclusions for Part I:

Part II: Dynamic Scene Annotation

Now let's look at some current work in automatically generating annotations of video sequences.

The first domain that we'll consider is the annotation of classical ballet. One reason that classical ballet lends itself to such an approach is that there are a number of strong constraints on what can happen in a scene. First of all, we know that the interesting things in the scene will be the motion of human beings, and therefore we can use the constraints of the human body: we know, for example, that limbs are attached to bodies in certain ways, have certain degrees of freedom, and so on. Another important fact about classical ballet is that there is a predefined language for describing it. Someone who is trained in ballet can unambiguously describe any classical ballet scene in terms of a known notation for choreography.

Suppose we want to represent a given motion of the body - say a motion of the leg. There are a number of alternative schemes for doing this. A problem is that each of these representations has a singular point - that is, a point in representation space, in whose region the representation breaks down. Different representation schemes have their singular points in different regions. This means that we can avoid the problems of singular points by using several representations. When we are near the singular point in one representation, we switch to another which is non-singular in that region.

The other domain that we'll consider is the automatic annotation of video sequences of (American) football plays. Football, like ballet, is a tightly constrained dynamic environment; the number of things that can happen is, in some sense bounded and reasonable. Furthermore, there is actually a commercial application for a technology that can automatically figure out football plays from video; millions of dollars of video annotating and editing equipment is sold in the U. S.. In the football videos available to us, both the camera and the players are moving. But we can remove the motion of the camera by stabilizing the lines on the field. We can choose some line - say, the twenty-yard line - and reprocess the video so that that line remains fixed. Another thing that we can do is transform the scene into a coordinate system where the yard lines are vertical - the "blimp's-eye" view of the stadium. The transformed video sequence is much more amenable to analysis; such tasks as finding the line of scrimmage, locating the quarterback, and so on, become much easier.

Conclusions:

_______________________________________________________________________________________

Layered Representations in Vision

Prof. Edward Adelson, MIT Media Lab

Prof. Adelson talked about two related topics: vision and image coding. He began his talk with a short discussion of pyramid coding: representing visual information as "pyramids" of images with different amounts of detail. He pointed out that while the image coding community was not initially very enthusiastic about this scheme, it eventually became the basis for widely used commercial image coding schemes, such as that used by Kodak to represent images on CD-ROMs.

Prof. Adelson then presented the outline of vision research shown in the figure above. Low-level vision and coding, he said, is fairly well understood, and research from this area has been applied, for example, to representation schemes and compression schemes such as JPEG and MPEG. Prof. Adelson's recent interest and work is in mid-level vision.

An important mid-level vision concept is that of layers. Prof. Adelson illustrated the idea of layers with a number of visual illusions and effects.

The Kanisza illusion shown above demonstrates one way that the vision system perceives objects that aren't really "there;" we have a strong sense that there is a white square in this figure, though there is no direct evidence for it (such as an edge). This is an example of how our vision system tends to interpret the visual world in terms of a number of overlapping layers.

Another is shown in the figure above. Here we interpret what we see as two overlapping transparent squares. Standard vision processing techniques, however would separate the figure into three smaller figures, by grouping together pixels with similar gray levels, as shown in the next figure:

What we would like is for a vision processing system to segment the scene into two squares, as follows:

Prof. Adelson used a number of other visual illusions to illustrate the importance of layers, and the strong tendency of the human visual system to interpret scenes in terms of layers.

Prof. Adelson also pointed out that layered representations are important when one object is moving in front of another. He illustrated this point with a short video from the MPEG test suite, known as the "flower garden" sequence. In this scene, the camera is moving past a tree, a flower garden, and a house. The tree, garden, and house are all different distances from the camera, and therefore move at different velocities across the scene. Prof. Adelson explained techniques for segmenting video sequences into layers, each of which is moving uniformly with respect the rest of the scene.

Prof. Adelson (together with John Wang, also of MIT) was able to segment the MPEG flower garden sequence into layers corresponding to the tree, garden and house. The scene can then be reconstructed by moving static images of the layers in front of the viewer in an appropriate manner. This essentially amounts to a significant compression of the video sequence, since all we need to store are the still images together with some information about how they should be moved relative to each other. The compression is not lossless, but the reconstructed scene is nevertheless quite convincing. The layered representation also allows for interesting effects such as removing the tree from the scene.

Layered representations are also useful for converting frame-rates. It's often necessary to convert video or motion-picture sequences from one frame-rate to another. Traditional techniques for doing this - essentially, interpolation techniques - don't produce satisfactory results (they produce blurred or double images). Using a layered representation, frame-rate conversion can be accomplished in a more natural and effective way.

In sum: in both biological and computer vision, the problem is to generate representations which are both useful and efficient. We'd like to move beyond low-level, signal-processing-based methods, to mid-level and someday high-level methods. We can look forward to a future where these more advanced methods with play a role in coding and representing images. _____________________________

IPUS: An Architecture for Integrated Signal Processing and Signal Interpretation

Prof. Hamid Nawab, Boston University

Prof. Nawab talked about the IPUS ("Integrated Processing and Understanding of Signals") architecture, designed and built by a group headed by himself and Prof. Victor Lesser of the Distributed Artificial Intelligence Laboratory at the University of Massachusetts, Amherst.

The IPUS system was built to process and understand natural sound, with particular focus on non-speech sounds: sounds such as doorbells, vacuum cleaners, hair-dryers, and so on that might naturally occur in some environment. Prof. Nawab contends that the basic concepts behind IPUS are applicable to other, perhaps non-auditory, domains as well.

Traditional machine perception architectures can be diagrammed as follows:

The key point here is the one-way flow of information between the "front-end" module and the "interpretation knowledge sources" module. This means that once the initial signal processing is performed, the front-end is no longer concerned with that particular input. Whatever it produces must be adequate for the subsequent higher level processes to do their job.

There are problems however, with using a fixed front-end strategy in a real world environment. Signal-to noise ratios vary, several objects may occur at once, signal signatures may interfere with each other, and object behavior is difficult to characterize and predict. As a result, a fixed front end (or signal processing algorithm, SPA) is not in fact adequate for real-world tasks.

Prof. Nawab illustrated some instances of how unitary SPA's break down in real-world situations. He also noted that one of the reasons for the failure of unitary SPA's is the Uncertainty Principle - in a linear signal processing system, resolution in frequency must be traded off against resolution in time. Different signal understanding tasks, however, have different time and frequency resolution requirements.

Prof. Nawab outlined a number of approaches to the problem that various investigators have used:

Prof. Nawab contends that due to the model variety problem, using multiple SPAs to try to cover every contingency is impractical - the real world consists of too many such contingencies.

The IPUS approach is control of the interaction between interpretation KSs and SPAs.

In this model, the space of interpretations and the space of SPAs are searched simultaneously. The system attempts to converge on both an SPA which produces data appropriate to the circumstances, and a consistent interpretation of the data. A key component of this strategy for dynamically finding appropriate SPAs is discrepancy detection. Prof. Nawab distinguishes three classes of discrepancies:

Prof. Nawab gave examples of these three kinds of discrepancies.

Once a discrepancy has been detected, it must be diagnosed. A diagnosis is a sequence of operators where each operator represents a signal processing effect. One operator, for example, might embody the knowledge that a certain SPA will not distinguish frequencies that differ by less than a certain amount. Such an operator corresponds to a hypothesis that certain frequencies were present in the input which are not present in the output.

A goal of differential diagnosis is to remove ambiguity. For some inputs, the sound understanding levels will produce several competing explanations. Differential diagnosis attempts to adjust the SPAs so as to converge on a single explanation. Prof. Nawab illustrated the disambiguation procedure with the following example:

In this example, we are assuming that the signal processing output (on the right) was produced either by Source A or Source B, but we don't know which. Source A has two relatively close frequency bands in the 1200 Hz range, while Source B has a band at about 1200 and another, lower energy band at about 2200 Hz. Either source could have produced the output one the right - if Source A produced it, the problem is that the algorithm being used is failing to resolve the two frequency bands; if Source B produced it, the problem is that the lower energy band is below a power threshold, and consequently being eliminated.

In this case, the differential diagnosis strategy proceeds as follows:

  1. Find missing and ambiguous evidence: In this case, deduce that there may be a missing band in the 2200 Hz region, and that there is only one band, where two might be expected, in the 1200 Hz region.
  2. Determine the time frame for reprocessing - that is, determine the time boundaries of the data that the reprocessing strategy will consider.
  3. Suggest new parameter setting to resolve conflicts: In this case, lower the energy threshold in the 2200 Hz region to try to locate the missing stream, and reset the FFT size to increase the frequency resolution in the 1200 Hz region.

Prof. Nawab concluded with a few overall facts about the program (it consists of 1500 Kb of LISP code; the source library has 35 real-world sources such as hair dryers, footsteps, fire alarms, etc.). He also stated that he believed that while the principle application of the IPUS architecture to date has been in the audio domain, that it has also been applied to the radar domain, and he believes that it could be appropriate to other domains as well. _____________________________

Lessons in Perception from Mammals

Tom Knight, MIT AI Lab

There are a number of tasks that humans (and other mammals) are able to do, that we'd like to understand better:

  1. Speech understanding, particularly speech understanding under natural conditions - the cocktail party effect, speech in the presence of noise.
  2. Object identification - identify sounds such as telephones.
  3. Localization - figure out where a sound is coming from.
  4. Navigation - use sound to reconstruct the environment so that we can find our way around in it.
  5. Sounds as perceptual objects - auditory scene analysis.

While many techniques emphasize regnerability, it should be pointed out that what we really want to do is discard most of the information, keeping only what's important. This is a non-obvious task; for example, the fact that one can discard some part of a speech signal and still have it understood by a human listener does not necessarily mean that the discarded part wasn't important.

Most current audio processing efforts - and in particular, most speech understanding efforts, make exclusive use of linear processing schemes: FFT's or constant-q filter banks. Note that linear schemes will always have to contend with the uncertainty principle: resolution in the time domain must be traded off against resolution in the frequency domain.

What, on the other hand, do we know about mammalian auditory physiology?

Though it seems obvious, we should remember that we have two ears. (Most current speech understanding research is monaural.)

The mechanical part of our hearing system can be divided into three stages: The outer ear, which acts as a passive filter and is useful for determining source location, the middle ear, which may be thought of as an impedance transformer together with an automatic gain control (the stapedius muscle), and the cochlea.

The cochlea may be thought of as a cylinder divided lengthwise into two fluid-filled chambers, separated by the basilar membrane. The basilar membrane is narrow and thick at one end, and wide and thin at the other end, meaning that certain sections of it are tuned to certain frequencies. It's therefore possible to think of the structure as a mechanical filterbank.

The tuning curves for this filterbank can be experimentally measured. The filters are band-pass with extremely sharp cutoff at the high end, and very flat cutoff at the low end.

The low end cutoff of each tuning-curve is non-fixed, due to a phenomenon called two-tone inhibition: if two tones are within the pass band of one of the filters, the filter will only respond to the higher frequency tone. The place theory of hearing says that the perceived frequency of a sound depends on which nerves attached to the basilar membrane are most active.`

If we look at the firings of the nerves which are attached to the basilar membrane, we find that the periods of the nerve firings tend to match the peak amplitude times of the incoming sound - in other words, the frequency information, in addition to being encoded in which nerve is firing (the place theory), is also encoded in how often the nerve is firing - the period theory. What follows are some speculations on as to how this system can finesse some of the limitations imposed by the Uncertainty Principle.

Suppose we have two signals, A and B, in the pass band of one of the cochlear filters, where the frequency of A is lower than that of B. B, in other words, is closer to the sharp high-frequency cutoff of the filter. The filter will respond more quickly to A than to B.

Now consider the case of two overlapping filters, X and Y, where X's cutoff is lower than that of Y. We apply signal A, which is in the passband of both filters. A is close to the band edge of signal X than that of signal Y; therefore Y will respond more quickly to it than X will. Note that A's frequency will be encoded on the fibers attached to both filters.

Now consider the case where we have two overlapping filters, X and Y, (with X's cutoff lower than that of Y, as before), and signals A and B, where A is in the passband of X and Y, and B is in the passband of Y, but beyond the cutoff of X. Because of two-tone inhibition, Y will not respond to signal A. In other words, Y responds only to B, and X responds only to A. In both cases, the frequency is encoded in the nerve firing rates. Also, in both cases, the response will be relatively slow, since each signal is close to the cutoff frequency of the filter that's reporting it.

In conclusion: we can, at least in a certain sense, have both high frequency resolution and high temporal resolution, though the temporal resolution is lower in the case of two signals that are close in frequency.

Workshop Session Summaries

Introduction

This is a series of summaries of each session of the Abstract Perception Workshop, 1/13/94 - 1/14/94.

Session 1: Commonalties (Moderator: S. Horiguchi)

What do vision and audition have in common? What do different stages of visual (aural) processing have in common? Why might these processes be common (for example, do they share a common genetic root, are they similar in form because they solve a similar problem)?

S. Ahmad, Interval Research

Abstract Neural Perception

Dr. Ahmad began by observing that the auditory and visual cortices have a similar anatomical structure, that in particular they both use topographic maps. He then described some work published in 1988 by Sur, Garraghty, and Roe, in which the optic nerves of newborn ferrets were rerouted to their auditory cortices. In this experiment, a "working" connection is formed, and as the ferret matures, structures are formed in the auditory cortex that are normally found in the visual cortex: on-center detectors, motion detectors, and orientation detectors. Ahmad summed up this part of his talk by suggesting that our question, "how are vision and audition the same" should be rephrased, "how are they different?"

Dr. Ahmad then proposed that given that the two areas appear to be very similar, we might look for common principles by which they operate. In this connection, Dr. Ahmad suggested that we look at certain connectionist research: in particular, the Hebb rule/maximal information preservation, and Kohonen maps. Dr. Ahmad observed that when a visual stimulus is input to a Kohonen map, structures are developed which resemble those actually present in the visual cortex.

Dr. Ahmad then mentioned some results concerning various low-level visual learning experiments, such as learning to discriminate orientations and learning to read at an angle. Several of the vision researchers present were familiar with these experiments, and there was some discussion of the issue.

Dr. Ahmad then went on to mention the "binding problem," that is, the problem of integrating information across several topographic maps. He stated that attention is thought to play a role in this connection. There appear to be particular areas of the brain that play a role in attention - the pulminar and the superior colliculus. This mechanism may be the same for both audition and vision.

_______________________

M. Akagi, JAIST

Modeling of Auditory Segregation

Prof. Akagi began by observing that while sound-spectrogram-style representations are the norm in auditory modeling research. Because the transform or filterbank representations used involve averaging operations, some of the detail present in the original data is not evident in the spectrogram representation. Prof. Akagi therefore proposes a sound-source separation technique which looks at deviations in zero-cross points of the original input sound. If we assume that the input is two mixed sine waves (of unknown frequency, phase, and magnitude), we can estimate their frequencies by numerically analyzing the input in the time domain. Prof. Akagi proceeded to demonstrate how this is done.

Prof. Akagi concluded by suggesting that better methods for auditory segregation might be possible if we can integrate time-domain techniques with the more usual spectrogram techniques. As far as commonalties go, Prof. Akagi concluded that vision are audition are not that similar at the lower levels of processing, where he believes, sound source separation takes place.

[Prof. Matsuyama commented that vision and audition, might be more similar when the time axis is taken into account in vision (as is often not the case), and that in this case, a technique similar to Prof. Akagi's might be useful in vision processing.]

_________________________

H. Ando, ATR Human Information Processing Laboratories,

Interpretation of Images and Sounds

Dr. Ando began by observing that perception can be considered at three levels, following (Marr, 1982):

Dr. Ando then proceeded to discuss relations between vision and audition from each of these three points of view.

Computational Theory:

The environment - natural objects and events, written and oral language, art, music, etc. - sends input to both the visual system and the auditory system. The visual input consists of light intensities and color naturally represented in 2D space and time, and the sound input consists of sound pressure waves, naturally represented as amplitude and time. Dr. Ando notes that these primary representations of images and sounds are basically different; they have a different dimensionality, for example. The goals of the visual and auditory systems, on the other hand, are similar; in both cases we are trying to interpret the information so as to take actions appropriate to the environment.

Representations and Algorithms:

The level of representations and algorithms may be further divided into three sublevels:

Low-level processing:

In the initial stages of vision processing, there is a local smoothing of spatio-temporal derivatives; at this stage the system also detects some discontinuities, fills in some missing data, and segments the scene into transparent layers. Dr. Ando conjectures that local smoothing of temporal derivatives may also play a part in low-level audio processing, and that the correlate of the filling-in and transparency operations would be sound source separation.

Dr. Ando then demonstrated the notions of filling-in and discontinuity detection in vision by briefly explaining some work in the interpretation of visual motion (Geman and Geman, 1984); the work concerns motion detection processes thought to be part of low-level vision.

Mid-level processing:

In vision, this level is concerned with the reconstruction of the 3D structure and motion of objects, using such cues as stereo, motion, shading and texture and such physical constraints such as smoothness and rigidity. We also reconstruct the surface properties of the objects in the scene, and separate reflectance from illumination. In audition, at this stage, we reconstruct the location and motion of sound sources, and attempt to reconstruct their material properties - the sounds of breaking, cracking, crunching, tapping, and so on, for example, are clues as to whether the material in question is wood, steel, glass, rubber, etc.. Dr. Ando was unsure as to the role of physical constraints in audition. [This question generally provoked a good deal of discussion during this session.]

Dr. Ando then illustrated mid-level vision processing with some of his own work on recovering surface structure from motion.

High-level processing:

In vision, one high-level processing tasks is the recognition and categorization of 3D objects. This involves extracting information from an object's image which is invariant across different viewpoints, and both supervised and unsupervised learning. Another high-level vision task is active vision - visually guided actions such as reaching, grasping, manipulating, navigating, and so on.

In audition, the corresponding task is the identification of sound sources. This, too, involves the extraction of invariant features, and various kinds of learning. We can also talk about active audition, that is, using auditory information to guide action: friend/foe recognition, navigation, etc..

Dr. Ando then outlined an approach to visual 3D object recognition and invariant feature extraction.

Hardware Implementation:

In this connection Dr. Ando mentioned work by (Sur, Pallas and Roe, 1990), which is a follow-up to the work on ferrets mentioned by Dr. Ahmad.

___________________________

D. Ellis, MIT Media Lab Perceptual Computing Group

Computer Models of Perception: Aspects Common to Hearing and Vision

Mr. Ellis began his presentation with a discussion of the organization of data in hearing. The overall data path leads from a continuous acoustic mixture to inferences about the Real World (RW). Auditory processing starts with an initial time-frequency analysis (and possibly other initial processing, cf., Knight, Akagi), which is then organized into components corresponding to sound sources. Heuristic principles used in this organizational process include common onset, common modulation, common periodicity, common spatial origin, and match to a specific known pattern.

Mr. Ellis presented an overall model of psychoacoustic grouping; the scheme first re-represents the sound, using a constant-q filterbank that models the cochlea, as a set of tracks, which are contiguous high-energy regions in time-frequency space. The tracks are then hierarchically grouped in such a way that the top-level groupings correspond to RW sound sources.

Mr. Ellis then proceeded to "abstract" this model, that is, to construct a generic perceptual model of which the psychoacoustic grouping scheme may be considered a (partial) instantiation.

Generic perceptual processing

Mr. Ellis suggests that the underlying reason behind the similarities in vision and audition is that both senses are aimed at maintaining an internal model of the world.

In the ensuing discussion, one of the questions raised was how closely machine realizations of perception match human perceptual processing. This led to the related question, "how many different ways are there of accomplishing perceptual tasks?" - so for example, if there's only one, then any successful perceptual model must be essentially the same, and successful machine models will necessarily work the in same way as human perception.

___________________________

Y. Hiraga, University of Library and Information Science

Preliminary Review of Previous Talks

Prof. Hiraga took the opportunity to reconsider the discussion that had taken place up until the time of his talk.

Prof. Hiraga first suggested that we had not yet found the right level of discussion to talk about similarities and differences in perceptual modes. He suggested that the level of abstraction considered by Ahmad - the level of neurons - was too low. Rosenthal's level, on the other hand, was too abstract.

Vision, and audition, Prof. Hiraga suggested, might be inappropriate examples for comparison of perceptual modes. In vision, what we sense is a passive property of the perceived object - the way that it reflects light. In audition, on the other hand, the stimulus that gives rise to the sense comes from the vibration of the source.

Prof. Hiraga suggested that at the lower levels of perception, the modes were quite different. Commonalties might arise at higher levels, where invariant features are extracted. Many investigators (e.g. Gibson) have pointed this out.

There may be some connections at intermediate levels, as evidenced by certain analogies (e.g., a "bright" sound). Prof. Hiraga did not think that these kinds of connections would prove very interesting, however.

__________________________

S. Shimotsuji, Toshiba Information and Communication System Laboratories

No Domain Independence in Practical Systems

Mr. Shimotsuji contended that "real" machine perception systems - those which have practical applications - make such heavy use of the context of the application that there cannot be any domain independence.

Mr. Shimotsuji illustrated these points with two systems developed at Toshiba. The first system is one which automatically reads cable-routing diagrams. Electricity providers in large cities maintain extensive diagrams of underground cables, and they would like for these diagrams to be machine-readable. The Toshiba system leverages constraints on the format of these diagrams to determine the meaning of their various features. The second application is one which interprets video images taken from a camera in a supermarket. This system also makes use of information particular to the environment in question: which parts of the scene can be expected to be stationary, which parts move, and so on.

Practically successful systems are those in which the system essentially classifies the input in terms of known patterns. The basic problem facing the builders of such systems is to extend them so that they cover enough cases to meet the demands of the user.

__________________________

Session 2: Knowledge (Moderator: E. Adelson)

How does knowledge aid perception? How do we integrate signal processing and symbolic processing modes? How do we integrate top-down and bottom up processing?

T. Inui, Kyoto University

How Does Knowledge Aid Perception?

Prof. Inui stated that the basic problem in image processing is the reconstruction of the 3D world from 2D images. He pointed out that this problem is generally ill-posed - i.e., there is not actually enough information in 2D images to reconstruct the 3D objects which gave rise to them.

The fundamental approach taken by Prof. Inui to this problem is to minimize an energy function of the form:

Here it is assumed that we have a 3D model of a visual scene which should match the scene actually presented to the vision system; the degree to which is fails to match (if, for example, the 3D model would predict a bright spot in an area where no bright spot is actually detected) is represented in the "data fitting" term. We also make some assumptions about the nature of the 3D objects involved, for example, that they are relatively smooth. The degree to which the model breaches this assumption is represented in the "smoothness constraint" term. l represents the relative weight given to the two considerations.

Prof. Inui showed a complex map that he and his colleagues have developed of the visual neural pathways of the brain. He noted that all of the connections between various processing centers represented in the map are reciprocal - that is, the data flows in both directions.

Prof. Inui's working framework is that the complete inverse optics function, R-1 doesn't exist, and that the brain therefore calculates an approximate inverse optics R#. From this, the brain develops an internal model of the visual world, from which it can compute a forward optics function, R, whose result can be checked against the original image. The fact that there are two computational directions involved in this model (R# and R-1 ) is consistent with the bi-directionality observed in the neural pathways.

Prof. Inui has implemented a model of vision along these lines, which exhibits robust performance on shape-from-shading tasks.

Prof. Inui concluded this part of his talk with the following observations:

  1. In visual computation, physical law is one of the constraints on the inverse mapping problem.
  2. There are feedforward and feedback connections among visual areas.
  3. The working assumption is that the feedforward and feedback connections correspond to approximate inverse optics and optics maps.
  4. A Bayesian estimation of the structure is realized through feedforward and feedback connections, and intrinsic connection in the visual area.
  5. Filling-in and line continuation processes are fundamental in the visual system.

In the second part of his talk, Prof. Inui illustrated the role of knowledge in vision with descriptions of an experiment by (Gilchrist, 1977), and a number of optical illusions.

[In the ensuing discussion, Prof. Matsuyama asked how l is calculated in the energy minimization equation. Prof. Inui replied that he didn't know, but that the brain apparently computes it dynamically, based on an estimate of the amount of noise present.]

__________________________

K. Kashino, Tanaka Laboratory, University of Tokyo

Sound Source Separation and Knowledge

Mr. Kashino began his talk with the question, "why do we need knowledge?" Some answers that he gave are:

  1. Perception problems are underconstrained, in the sense that the properties of the world that we wish to reconstruct are not regenerable from the input data alone. Use of knowledge is one way in which we fill in the gaps.
  2. Use of knowledge permits us to perform robust processing in the presence of noise.
  3. Since there is no ideal, fixed strategy for front-end processing, sensing parameters must be dynamically adjusted on the basis of knowledge of the environment.

Mr. Kashino then outlined some approaches to knowledge representation used in machine perception systems, e.g.:

  1. Declarative representations, which include symbolic schemes such as graphs, frames, if-then rules, and so on, as well as stored patterns against which input data may be matched.
  2. Procedural or planning schemes.
  3. Distributed representations of knowledge, such as neural networks or multiple-agent models.

Thirdly, Mr. Kashino posed the question, "How can we control knowledge-based processing. He outlined three approaches:

  1. Top-Down/Bottom-up, as in blackboard systems.
  2. Methods for dealing with uncertainty, such as fuzzy logic or Dempster-Shafer belief accumulation.
  3. Handling non-monotonicity by using hypothetical reasoning.

Mr. Kashino then illustrated his points by describing the sound source separation system that he is working on. He contends that perceived sound is organized hierarchically, as in the following diagram:

At the lowest (leaf) levels of such a hierarchy sound may be described in terms of parameters (e.g., this much energy in this time-frequency region, for example). The problem of separating the sound according to source, then may be seen as one of obtaining parameters for the individual sound from the set of parameters for the composite sound. Mr. Kashino proposes a model of sound source separation represented in the following diagram:

A good deal of the knowledge necessary for sound source separation is in the form of "cues" that several components should be grouped together. Some useful cues are:

Mr. Kashino then illustrated his scheme for clustering sound components, which uses the Dempster-Shafer rule. His clustering methods are consonant with psychoacoustical data obtained from experiments that Mr. Kashino ran. Finally, Mr. Kashino discussed the results of some benchmarks tests of his system, and outlined some future directions.

____________________________

Frank Klassner, University of Massachusetts, Distributed Problem Solving Laboratory

Where Are We Putting The Knowledge?

Mr. Klassner posed the following question about knowledge: where, in various styles of machine perception systems, is knowledge embodied? Possible answers include:

  1. The "Hearsay II" approach - or the idea of an interpretation knowledge source (KS). A KS is a process which can generate a new hypothesis that explains the input data. For example, one knowledge source might hypothesize that the input data contains a certain word, if other knowledge sources have already hypothesized that some of the constituent syllables of the word are present. The Hearsay II system attempted to compensate for inadequacies (noisiness, missing data, etc.) of the input data through the use of KSs.
  2. We can make use of domain knowledge to make the (non-dynamic) initial processing of the input (e.g., the parameters of a filterbank or a smoothing function) as appropriate as possible. These initial processing methods are referred to as SPAs (for signal processing algorithm). Mr. Klassner places work by Rodney Brooks in this second category.
  3. We can concentrate on controlling the application of the KS -Ęthat is, we try to be as intelligent as possible as to which KS to apply in a given situation. An example of this approach is the RESUN system of Dr. Norman Carver.
  4. We can concentrate on controlling the application of SPAs.
  5. We can concentrate on controlling the interaction between SPAs and KSs.

The IPUS (Integrated Processing and Understanding of Signals) system on which Mr. Klassner has worked follows this strategy. KSs - that is, programmatic embodiments of knowledge of the domain, such as a program that generates words from strings of syllables - are used to control the initial processing of the input.

_______________________

T. Matsuyama, Okayama University

Cooperative Spatial Reasoning for Image Understanding

Prof. Matsuyama began his presentation by outlining the types of information used for understanding images. He divided these into two groups: first, attributes of objects, which includes brightness, color, texture, shape, and location, and second, relations between objects: spatial relations, part-of relations, specialization/generalization relations, similarity relations, temporal information, and causal relations.

Relations between objects constrain the interpretation of a scene. The process of detecting features and then determining their relationship is one which is used at all levels of vision processing.

Implicit in this scheme, however, is the idea that image features will be accurately detected. What kinds of schemes can support the possibility of a feature's not being properly detected? One such scheme is a multi-agent, or cooperative reasoning scheme. Prof. Matsuyama noted that the same kind of problem arises in audio processing, and the same kind of solution has been shown to be useful.

Prof. Matsuyama addressed the question "How do we integrate bottom-up and top-down reasoning?" He illustrated two kinds of processes that are part of the SIGMA vision system. In the first, the "bottom-up" case, a feature s is detected which is normally in a certain spatial relation to a feature f(s). Furthermore, a feature t with the appropriate characteristics is detected in the position where f(s) is expected. We then establish the relation REL(s,t).

The second case is the "top-down" case, the feature s is detected, but no corresponding t is detected. Furthermore, another detected feature, u, also expects t to be in the same place. We then hypothesize the existence of t. We may, in turn, lower some detection threshold in the region where t is expected.

Prof. Matsuyama illustrated this principle in the context of an aerial photograph interpretation problem, where a house and a road played the roles of s and u, respectively, and a difficult-to-detect driveway played the role of t.

Finally Prof. Matsuyama presented some recent research on using a cooperative-agent scheme to segment a scene into regions. He illustrated two cases, in which there were two different cooperation schemes; the one where agents were better informed about each other's state segmented the scene in a more satisfactory manner.

_________________________

Y. Sakaguchi, Department of Mathematical Engineering and Information Physics, University of Tokyo

Sensory Integration, Active Perception and Knowledge

Mr. Sakaguchi began his presentation by stating that his research goal is the computational modeling of the human brain, rather than the engineering of the system with the best-possible performance. This may lead, he believes, to a system with "human failings" as well as human abilities.

Mr. Sakaguchi's interest is in sensory integration and active perception. In active perception, an array of sensors of various types are used in a goal-driven manner by a mechanism which is trying to establish particular facts about the outside world. This mechanism corresponds, in an obvious way, to the mechanism of attention.

Mr. Sakaguchi's model constructs an internal image of the presented object. The internal image has an entropy, or ambiguity value associated with it. The model decides which sensor to use according to the "mutual information" criterion, that is, according to the criterion of attempting to reduce the entropy of the internal image of the presented object.

Mr. Sakaguchi illustrated these principles with two systems that he has worked on. One system is a haptic recognition system, which attempts to distinguish among various kinds of materials using touch: a sensor attached to a robot arm rubs the material; various sensors from the robot are input into a model of the kind shown above. The other is an active vision system; it attempts to recognize figures by moving an optic sensor to an appropriate position.

In conclusion, Mr. Sakaguchi surveyed the role of knowledge in sensory integration and active perception.

  1. Though it seems obvious, it should be pointed out that such a system must embody knowledge about the relations between the sensory signals and the object's attributes, and how sensory signals can be expected to relate to each other in various situations.
  2. The system must use knowledge to determine which sensor is informative in a given situation.
  3. Knowledge can "complement," or fill in the missing parts of, available sensory information in constructing the internal image.

Following Mr. Sakaguchi's presentation, there was some discussion of the role of learning, and it's relation to how knowledge is used in perception.

_________________________

Extraordinary Session: "What Can Audition People Learn From Vision People," led by H. Nawab

In view of the fact that there is more history to the vision problem than to the audition problem, Prof. Nawab felt it would be interesting to pose the question, "what can audition people learn from vision people?" In particular, what mistakes have vision people made that audition people might be able to learn from? What, on the other hand, is generally agreed to have worked well in vision?

Prof. Bobick answered with an account of intrinsic images. An intrinsic image is a set of registered "images" describing scene surfaces with respect to depth, surface orientation, reflectance, and incident illumination. It was believed at one time that intrinsic images could be calculated by independent modules, such as "shape from shading" or "structure from motion." But vision researchers don't generally regard intrinsic images as a good overall strategy anymore; these modules, it's now generally thought, cannot be independent.

Bobick also pointed out that a factor which played a role in this development was a movement, in the early 1980's to divorce AI from vision.

Tom Knight observed that lack of sufficient computational power was an important factor in vision development. Small memories and slow computers mean that only a few examples can be tried, and it's not practical in many cases to investigate how well two algorithms might work in tandem. Audition, he pointed out, involves order-of-magnitude lower information bandwidth than vision.

Prof. Matsuyama pointed out that since vision research seems to have been more or less stalled in the last 15 years, a new approach might be in order. Until now, most vision research has taken place on static images. Perhaps dynamic vision - that is, considering how scenes change in time - is such a new approach. Prof. Matsuyama pointed out that very little is known about dynamic vision; it's not known, for example, how much static vision research is applicable. Since audition research, because of the nature of sound, has never had any choice but to be dynamic, perhaps this is an area where vision people can learn something from audition people.

Prof. Adelson made a somewhat similar point. In vision, he said, it has generally seemed meaningful to talk about point-wise properties in an image. One intrinsic image idea, for example, is to say something like "this pixel corresponds to the surface of this object, which has this color, this distance." In audition, on the other hand, one doesn't point to a voltage at a given time and say "that's a clarinet;" a clarinet is only reflected in a complex of voltages. But perhaps this difference is really illusory; recent work which considers transparency or motion indicate that in vision, as well, it is a mistake to talk about pointwise properties. "Vision people have been able to fool themselves," he said, "while audition people have never had this illusion."

Rosenthal noted that while Bobick had criticized some vision research, particularly "implicit images," as being over specialized and insufficiently integrative, that audition research appeared to have a similar problem - most research is directed to one very specialized area, to wit, speech understanding.

This led to a discussion of the ARPA Speech Understanding Project, particularly, a well-known incident in which a speech understanding system interpreted a cough as the phrase "pawn to king four." Dr. Knight noted that the ARPA project corresponded to a phase during which very little attention was paid to lower level, signal processing issues, in the belief that all of these problems could be solved with knowledge-based processing at a higher level.

Dr. Knight also contended (addressing the earlier discussion of Rosenthal and Bobick) that independent modules for processing special kinds of sounds (such as speech) might still be viable, though the issue of coordination between them can't be completely ignored.

Ruttenberg raised the issue that we can understand black and white cartoons even though there are no issues of color, transparency, etc..

Adelson noted that the most-often-used basic representation in audition research is a time/frequency (sonogram) representation. He asked if there were others. Nawab replied that as far as he knew, time/frequency representations seemed to be the most common. He also noted that the idea of a time/frequency representation is not mathematically very well-based, and that it's a mistake to confuse frequency and the result of the Fourier transform.

Ellis contended that the reason a time-frequency representation is appropriate is that that's what the cochlea does. Adelson asked why it is that the cochlea does this; this was followed by a discussion of representations for hearing.

Bobick raised some points about the biological motivation of hearing and vision, and its relation to knowledge; there was some general discussion of this issue.

___________________________

Session 3: Appropriate Domain (Moderator: D. Rosenthal)

What is the appropriate domain for "abstract perception?" What kinds of problems of methods are common to several perceptual domains, and which are specific to a particular one? What can we hope to gain by finding out?

T. Abe, JAIST

Automatic Identification of Human Faces Using 3D Data

Prof. Abe presented a method of computer recognition of human faces using 3D data. Most current face recognition systems, he pointed out, work from 2D data; 3D data, however, has the advantage of being camera-position independent, and provides more usable features than 2D data.

The basic method consists of the following steps:

  1. The surface of a face is extracted from 3D data.
  2. The surface is approximated to a B-spline surface using least-squares approximation.
  3. Surfaces features are re-represented as a small number of points (vertices).
  4. The vertices are used as feature vectors for identification.

The 3D data is originally extracted from the face using color-encoded structured light. Face data is transformed to a coordinate system so that the origin is at the tip of the nose, the face lies in the x-y plane, and the nose points up along the z-axis. Once the vertices are extracted, the Euclidean distance between two vectors of vertices is used to determine whether they match or not.

Prof. Abe then explained in detail the process of creating the B-spline surface from the original data, and of determining the vertices.

Finally, Prof. Abe reported the results of testing the system on a dataset of 165 faces, derived from 33 persons, with 5 data sets per person. In this trial, the identifications accuracy of the system was 99.4%. Prof. Abe mentioned that a system using 2D data achieved 80.5% accuracy.

_____________________________

I. Fujinaga, McGill University

Lessons from A Learning System for Optical Recognition of Musical Scores

Mr. Fujinaga felt that the role of learning in perceptual systems should be emphasized; that sophisticated models of perceptual systems had to include the ability to learn. Living organisms, especially those that can move around, all seem to incorporate such an ability.

Mr. Fujinaga has been working on a system to optically recognize music scores. Since the number of ways that musical symbols are drawn is very large, varies from publisher to publisher, and is constantly growing, creating a database of such symbols by hand is impractical, and the system must be endowed with the ability to learn.

Mr. Fujinaga used a nearest-neighbor method for classifying symbols. This method is advantageous in that accuracy improves over time; however, the system becomes slower, in contrast with biological systems, whose time-performance usually improves as they learn.

The reason that the system's time performance degrades as it learns more is that the system stores every symbol that it sees, and then compares every new symbol with every stored symbol. It seems reasonable to assume, however, that matching each new symbol against large numbers of previously-seen symbols is probably unnecessary, and that a subset of such symbols would be sufficient. If such a subset were found, the system's time-performance could be improved. Mr. Fujinaga applied learning techniques to this problem as well. Furthermore, Mr. Fujinaga applied learning techniques to the problem of adjusting the weights given to various features (moments) of the stored symbols. Finally, Mr. Fujinaga maintained statistics on which of several learning algorithms produced the best results, so that the system could "learn how to learn."

Following Mr. Fujinaga's presentation, there was discussion about which aspects of the system are applicable to other perception problems, in particular, to audio processing problems.

_____________________________

Y. Kakita, Kanazawa Institute of Technology

Appropriate Domain

Prof. Kakita began by introducing himself and his research interests, which are in speech science; Prof. Kakita is interested in models of how speech is articulated by the tongue and lips. His other interests include acoustic diagnosis of vocal diseases, and development of artificial larynxes.

Prof. Kakita introduced certain perceptual effects relevant to his work, which indicate organization of perceptual information by high-level processes (in this case, those responsible for understanding speech). The first is illustrated by the following diagram:

The sounds represented in A and B in the above figure are the component formants of a vowel. When they are heard separately, they are perceived as distinct tones; when they are heard together, as in C, we hear only one tone, corresponding to the vowel.

Another effect relates the perception of /L/ and /R/ sounds, a distinction many Japanese have difficulty with.

The distinction between "la" and "ra" is contained in the third formant, as the diagram above shows. When Japanese subjects hear the third formant in isolation, they have no trouble distinguishing the two sounds; when the three formats are heard together, however, Japanese listeners perceive the two as sounding the same, indicating that higher-level, learned processing is causing the distinction to be ignored.

Prof. Kakita referred to the theory of French linguist Ferdinand de Sassure, who proposed that "auditory images" corresponding to various speech sounds are formed in the brain; the specific speech sounds in question are arbitrary and therefore vary from language to language.

Prof. Kakita also mentioned the hypothesis proposed by the Haskins Laboratory (Connecticut, USA) speech research group, which says that we have "an image of the speech organs, and of their motions," somewhere in the brain. Prof. Kakita and his colleagues have expanded this theory, and found theoretical evidence pointing to a geometric mapping of tongue shape in the cognitive levels of the brain. Prof. Kakita suggests that such a mapping may lie in the intersection of auditory and visual processing, and he is therefore interested in facts relating to this intersection.

In the final part of his presentation, Prof. Kakita presented a series of expressions, illusions, and proverbs from both Eastern and Western sources that bear on the question of how perception works.

The first was the Japanese expression "soramimi," or "illusory hearing." An example of this is imagining that one hears a baby crying when one hears the sound of water running.

Another example was that a triangle of dots will tend to be seen as a face, rather than a triangle. The third was a Japanese expression which translates as "What I saw as a ghost in the dark was nothing but the "kare-obana" weed." The fourth was a Japanese proverb, whose translation is "If you pay no attention, [lit., if your mind is not here] you can't comprehend what you see, and you can't understand what you hear." Finally, he mentioned a quote from "Le Petit Prince," by Antoine do Saint-Exupery: "On ne voit qu'avec le coeur. L'essentiel est invisible pour les yeux." ("One only sees with the mind (lit., heart); the essential is invisible to the eyes.") The common thread of these examples was that they point to the role of higher-level processing in perception.

Prof. Kakita concluded by stating that if we want to build an artificial system that behaves like humans, that it should be able to handle the cases mentioned in his examples. Prof. Kakita's general framework for emulating perception is:

  1. From signal to pattern.
  2. From pattern to meaning.
  3. From meaning to object.

__________________________

H. Okuno, NTT Basic Research Labs

Auditory Scene Analysis With Multi-Agent System

Dr. Okuno observed that while several vision researchers had mentioned the problem of "inverse optics," there didn't seem to be a corresponding problem of "inverse acoustics;" the correlate of inverse optics in acoustical processing, Dr. Okuno suggested, was auditory scene analysis. In neither case can we hope for a complete solution, and for this reason active perception - perception guided by the goal of discovering specific information - is important.

Dr. Okuno mentioned that both Prof. Matsuyama's vision group and Dr. Okuno's audition group were pursuing multiagent approaches to perception, and that they had both participated in the MACC (Multiple Agents and Cooperative Computing) conference of 1993. Dr. Okuno suggested that multiagent systems are appropriate where the desire is to produce goal-oriented behavior, situation-based behavior, adaptive behavior, openness, or robustness.

The domain to which Dr. Okuno has been applying multi-agent systems is sound source separation. In this application, sound is modeled as a set of streams, each corresponding to a separate source, and each characterized by consistency in some dimension. For each stream, a programmatic agent is generated, which tracks the stream as it develops in time. The operation of Dr. Okuno's system is represented in the following figure:

In its initial state, the sound is input into a tracer generator. Whenever the tracer generator detects a new stream (or a stream not already being tracked by any other tracer), the tracer generator produces a new tracer.

Dr. Okuno demonstrated his system on audio data consisting of voice plus sine tones at various pitches, and a mixture of two voices (a man's and a woman's). In the initial version of the system that Dr. Okuno presented, when the system was attempting to separate the voice and the sine tone, the tracer that was tracing the voice added part of the sine tone at the end. This problem was caused because the system had produced redundant tracers; it no way of knowing the two different agents were tracing the same tone. Another problem with the initial system was ghost tracers - tracers that trace tones not really present in the input.

In a more advanced version of the system, each tracer has a monitor. The monitor terminates the tracer if it is inappropriate (if, for example, it appears to be a ghost tracer) or if it is redundant. The system with monitors avoids the problems of the initial system, with corresponding improvement in the output.

_____________________________

A. Ruttenberg, MIT Media Lab Learning and Common Sense Group

Summary of Issues Raised in Previous Talks

Mr. Ruttenberg began with a few words about his own research interests, which are optical musical score reading and in automatic musical composition. He described his score reading system as consisting of low-level system for detecting features, on top of which is a reasoning system similar to that of Prof. Matsuyama's Sigma system, and finally a high-level system for eliminating ambiguity. Mr. Ruttenberg noted that like many of the systems described at the workshop, the score reading system starts out in a "numerical" domain (where features are detected), and ends up in a symbolic domain, where constraint propagation is used. A feature of Ruttenberg's system is a way of explicitly representing ambiguity; this is necessary because the feature detectors are tuned to be very aggressive; the produce many wrong answers, in order to avoid ever failing to produce the right answer.

Based on his experience with this system and his review of the previous talks, Mr. Ruttenberg raised the following six issues:

  1. How can we get past high-level generalizations such as "use knowledge," "use feedback," or "use agents," to more specific abstract perception issues? The problem, contends Mr. Ruttenberg, is that there are very few examples of working computational perception systems. This makes it difficult to analyze what such systems might have in common.
  2. A path from numerical to symbolic reasoning characterizes many of the systems discussed at the workshop, such as IPUS, the optical score reading systems, SIGMA, dspB (Dan Ellis' sound source separation project), and the systems described by Bobick and Adelson.
  3. Where is the boundary between perception and reasoning? Some processes are clearly one or the other, but there seem to be many intermediate processes not immediately classifiable as one or the other. There may be something to be gained by clarifying this matter.
  4. Mr. Ruttenberg contends that there may be an interesting interplay between story-understanding and perception. If one has a certain "story" in mind, say, about someone getting ready to go on a trip, a certain sound may be perceived as the sound of someone buckling a suitcase; in other circumstances, the same sound might be heard as someone washing dishes.
  5. Mr. Ruttenberg mentioned the book "Women, Fire, and Dangerous Things," a study of metaphor by Lakoff and Johnson. While several people at the workshop pointed out that reasoning can influence perception, this work suggests that the reverse is also true - that is, the way we perceive influences the way we think. Certain abstract reasoning paradigms can be thought of in terms of "inside/outside," logic, for example.
  6. How do we deal with ambiguity? In some systems, such as Mr. Ruttenberg's optical score reading system and in the SIGMA system, ambiguity is represented by multiple hypotheses. In the ALV vision system described by Bobick, on the other hand, ambiguity is handled by using a simpler representation until we have enough information to add more detail.

Conclusions

In this section I (the editor) try to outline some of the workshop's broader themes. This is not, I should note, the only attempt to summarize the workshop: in particular, the presentations of both Dr. Ando and Mr. Ruttenberg provided broad overviews or interpretations of material presented by other researchers.

Themes

  1. A view expressed by many workshop that the earlier stages of visual and auditory processing are fairly different, but that the goal of the processing is in each case the same - namely, to obtain useful information about the environment. We should therefore look for similarities at the higher processing levels, where the processing goals are presumably of more immediate concern. Before setting too much store by this apparent consensus, we should note that the terms "high-level," "mid-level," and "low-level" were used in very different ways by different researchers. Some, for example, described sound source separation as a low-level process, while others, I'm sure would call it at least a mid-level process.

    Thus the question explored in the workshop was not so much "Is there commonalty," but "when does it start, and how significant is it." While it may be the case that no one at the workshop contended that vision and audition are entirely dissimilar, some doubted that these similarities would be important in the design of practical perception systems.

  2. Even the most "pro-commonalty" researchers don't suggest that the same brain tissue does both vision and audition. The ferret experiment of Sur, et al, discussed by Dr. Ahmad and Dr. Ando assumes that visual and auditory processing are accomplished in different physical locations. These experimenters suggest, however, that those processing centers have a very similar structure.
  3. Two of the systems presented at the workshop might fairly be called "abstract perception" systems. Both the IPUS system discussed by Prof. Nawab and Mr. Klassner, and the sensor interpretation system described by Mr. Sakaguchi are systems for interpreting perceptual information which may come from a variety sensor types.
  4. Vision is often expressed as a problem of producing a 3D model from a 2D image. While this problem is generally ill-posed, the solutions are constrained by physical laws and properties of illuminance and reflectance of materials. There seems to be an obvious mapping from the points on an image to the real-world objects that gave rise to them. In audition, on the other hand, the situation seems less clear. The voltage on a wire coming off a microphone at a particular time does not directly correspond with an object in the real world. There was a good deal of discussion of this apparent difference between vision and audition.

    Professor Adelson, however, suggested that the apparently greater clarity of the vision problem may be illusory. His research suggests that since the eye generally interprets visual images as representing a series of layers, it is a mistake to interpret a pixel as representing the light reflecting of a certain part of a certain surface. Like a microphone voltage, the brightness values of a pixels represent a complex summary of physical events for which there is no straightforward decoding.

  5. The presentations of Prof. Matsuyama and Dr. Okuno seemed to support the contention, made in the workshop invitation, that researchers in audition and vision were working on the same problems. Both researchers applied a multiple-agent system to a perception problem - visual segmentation in the case of Prof. Matsuyama, sound-source separation in the case of Dr. Okuno. Both apparently ran into the same kinds of problems, in essence because their agents didn't know enough about each other, and both solved them by making their agents better informed. The similarity of the presentations may not be entirely coincidental; the two were previously aware of each other's work, but in my view it was striking nevertheless.
  6. A good deal of discussion centered, in various ways, on issues of how high-level goals and knowledge bear on preliminary processing. The thrust of Dr. Knight's symposium presentation was that too little attention has been paid to issues of preliminary processing in mammalian auditory systems. Prof. Nawab's talk concerned how preliminary processing can be controlled by higher-level considerations. Prof. Matsuyama, Mr. Klassner, Mr. Kashino, Prof. Bobick, and others also considered this issue. Prof. Abe showed that a face recognition system's performance can be significantly enhanced if it starts out with 3D, rather than 2D information.
  7. It was pointed out that there was markedly little discussion of learning. The only presenter to concern himself principally with learning issues was, in fact, Mr. Fujinaga, who has built a system that improves its own performance on a musical score-reading task. I suspect that most of the researchers at the conference feel that learning can only be profitably explored once the more basic issues are fairly well understood. Researchers tend to suspect that computers will not be able to learn to do things that we don't know how to do. Mr. Fujinaga's system, however, appears to do exactly that. It learns to recognize glyphs in a musical score better than we can teach it by hand. It thus uses learning in a profitable manner for a problem that is usually regarded as difficult and not well-understood. This raises interesting questions of whether such techniques might be applicable to other areas of perception.
  8. There was also a good deal of discussion of the view that vision is dynamic - in other words, it is a mistake to disregard the time-axis in vision (though most vision research to date has in fact done exactly that. The introduction of the time axis into the vision processing world raises obvious issues of hitherto unremarked parallels with audition since, as one researcher remarked, audition researchers have never been able to fool themselves into thinking they could ignore it.

Perceptual Grouping

I'll conclude with some somewhat subjective observations on Abstract Perception, and on future directions that I believe are worth pursuing.

In my view, many important perception problems can be viewed as problems of hierarchically grouping primitive elements. This is the case even when the ultimate goal of the system appears not to have anything directly to do with grouping.

A speech processing researcher, for example, might claim that the goal of their system is to determine which word is represented by a certain digital audio signal. But it is often the case that once one knows that the signal in question represents one and only one word, and that it doesn't represent any significant amounts of noise or other sounds, then the problem is easy. Isolated word recognition, in other words, is easy. It is continuous speech recognition that is difficult. To transform a continuous speech recognition problem into an isolated word recognition problem, we need to divide the signal into parts, and to form those parts into appropriate groups. A similar argument holds for other audio understanding tasks, such as pitch detection or non-speech sound identification.

In the visual domain, the designer of a visual navigation system might state their problem as: the detection and avoidance of obstacles. The obstacle in question, however, must first be segmented from its background. Once it is segmented, it must be organized it into meaningful parts. Both of these are grouping problems.

Many vision and audition systems and approaches are explicitly ways of grouping primitive elements:

The justifications for forming groups in various domains and at various levels will, I believe, tend to be different. In vision, a certain grouping may be preferred over others due to considerations involving the reflectance properties of some material, or the physics of how points on a spinning object move relative to each other. In audition, the justification for a grouping may be that sound elements with a common onset are likely to have originated from a common physical source, or it may be influenced by the properties of a mechanical filter in the auditory system. It's usually the case that a number of such factors are in operation simultaneously, and one of the common problems is to find good ways of integrating these factors into a single answer.

The pervasiveness of these grouping problems is what, in my view, binds perception problems in different domains together, and distinguishes them from other kinds of computations. I believe that there is justification for studying the (abstract) problem of forming groups of primitive elements into hierarchies. The results of such a study will in turn be useful for applications in specific domains.

Future Work

I hope, of course, that this argument partly convinces, or at least intrigues, some of its readers. In my view, however, the next step is not to try to make better arguments of this sort, but rather to try to build a system that implements these ideas, and I am in fact working on such a system. The system (called APE - Abstract Perception Engine) is essentially a means of forming and manipulating perceptual groupings. The input to such a system is a list of primitive elements, together with justifications for grouping them in certain ways; the output is graded lists of hierarchical groupings. The system has been partially implemented and applied to the domains of rhythm parsing and sound source separation.

Appendix: Workshop Interview Summaries

Interviewees were asked to give their general impressions of the workshop, and invited to suggest improvements.

Interviewer: David Rosenthal

1. Prof. Adelson, MIT Media Lab

The conference has generally been very interesting and stimulating.

I would have liked to see more tutorial material in the style of Tom Knight's symposium lecture. Vision people (such as myself) don't know much about the fundamental problems in audition. Mutual education would be beneficial. I'm not so much interested in audition for its own sake, but I'm interested in ways that research in audition might apply to vision.

Holding tutorials for non-experts might be comparable in some ways to teaching undergraduates. Teaching undergraduates is an interesting experience - on the one hand, it's frustrating, because they don't know anything. But on the other hand, it's stimulating, because you have to rethink the fundamentals and basic assumptions of your field.

I find it frustrating when people describe their work in abstract terms, as tends to happen at such gatherings. When people look for commonalties, they tend to get too abstract. I'd like to know about the guts of what they are doing. Starting out with a concrete example, and moving towards an abstraction is the way to go, in my view.

The initial literature had specific examples, which I thought were interesting. I would have liked to see more of that work [dfr's] described there. [N.B. In response to this and similar comments, I made a presentation of my own work the next day - dfr.]

Is there any particular reason why there weren't any women at this conference?

2. Prof. Bobick, MIT Media Lab

Well, I came skeptical, but I leave less skeptical. It seems to me that between vision and audition the answers [to perceptual modeling problems] may not be the same, but the options and strategies are similar. For example, the way that my football-play tracking system switches it's tracking strategy based on knowledge is similar to the way IPUS works. But we don't have all that much to learn from each other.

I was glad to be able to find out more about sound. It's difficult to find researchers interested in the general problem of processing sound; most of this work is centered on speech processing.

It seems to me that Shimon Ullman's visual routines work is relevant. [Bobick then explained a few of the basics of Ullman's work, which I haven't included here.]

Audio processing people need to realize the importance of mid-level representations. [What can vision people tell them about that?] Not much! We don't have good mid-level representations either!

3. Dr. Knight, MIT AI Lab

I've found the workshop very enjoyable. There are a lot of interesting people around.

I've been thinking about who is "missing," that is, people that it might be a good idea to have at such a workshop. Eric Grimson and Shimon Ullman come to mind. [N.B. These are both vision professors in the MIT AI Lab.] Ullman's ideas are basically abstract perception ideas.

You might also think about having a "dyed-in-the-wool" speech understanding person around, someone interested in Hidden Markov techniques, for example. "We" - i.e., the cognitive AI people of the kind represented at this workshop - have fundamental disagreements with these people, but it would be a mistake to ignore them, and they might provide a valuable perspective. The kinds of statistical techniques they are pursuing will eventually "top out" - they'll reach a stage where they can't be improved any further. But for now, they perform better than the reasoning techniques pursued by people here. We need to be able to make our case convincingly.

4. Prof. Hiraga, University of Library and Information Science

I've been having a very pleasant and stimulating time. The staff members have done a very good job. But it's difficult to comment on any concrete accomplishments of the workshop. More email discussion might have helped make the discussion more substantive. The effort was made, so perhaps the participants are at fault. I don't really know how to make this happen.

Abstract Perception doesn't relate directly to my research interests. I'm more inclined to look at specific domains. The symposium talks were very interesting, but today's talks concerned mainly personal research. People seemed a bit at a loss. I'm not sure how motivated you are in presenting Abstract Perception.

The study of multimedia systems was under-represented. This could have lead to a common basis for discussion.

5. Prof. Matsuyama, Okayama University

It's necessary to have more leadership in guiding the discussion. Nawab's session in particular was diverse and difficult to follow. One can't force people to a conclusion, but they should be more strictly guided. People need to be reminded of the original questions of the workshop. The first session was interesting, but in the second session, we were "too punctual;" we shouldn't have been so strict about the ten-minute limit.

The younger Japanese researchers only talk about their own work, and are afraid to speculate. [What would you suggest to correct this?] I don't know, it's a very difficult problem. Several years I ran a workshop where tried to introduce young researchers to problems of computer vision, we encountered the same problem. I am a somewhat exceptional Japanese researcher in this respect, since I say what I think; this is not common. Having a clear opinion is considered dangerous.

This workshop is important and there should be more of them. It should be made clear what benefit IMRF derives from such a workshop.

Interviewer: Kunio Kashino

1. Prof. Kakita, Kanazawa Institute of Technology

General Impression?

It was interesting to me because there are few people around me who are doing research on vision, and I have never thought about the relationship of vision and audition. I learned that the number of people that have thought about this is small.

What was good was that I learned a lot of new things and I met people from a broad range of areas that I could talk to. As for improvements, it should be made clearer what the role of the presenter is. It wasn't clear how people should use their ten minutes. In the morning, people didn't stick to the ten-minute limit. The format wasn't clear. Each presentation was interesting but we couldn't pursue it through discussion. Perhaps a format of presentations only in the morning and discussion only in the afternoon would be a good idea. It might have been nice to have another day.

[Kashino - David's policy was to avoid putting a lot of constraints on presenters.]

But this policy was not announced. When I talked to Horiguchi-sensei, he didn't know about this policy. I understand the spirit of such a policy. I think that some of the presenters did not, however, and this was a problem. Some people talked about their own work, while others talked more generally. In today's session, the Japanese researchers were very intent on presenting their own work. A clearer statement of policy beforehand might have helped.

It's also difficult for the Japanese researchers to participate in English discussions. If you want Japanese researchers to talk about things outside of their specialties, they'll need more preparation that the American researchers. Japanese participants had trouble with the term "domain;" we weren't sure if this term had a special meaning in, for example, vision research.

[Kashino - each of vision and audition is a domain]

Before the interview I was talking with Fujinaga-san, and he was saying that each of the five senses should be considered a domain.

I would have liked to pursue questions about the "exchangeable cortex" ferret experiments further. At what level is this exchange made? But there wasn't enough time.

The talk about spatial mapping in auditory research is reminiscent of some of the things that come up in speech modeling, where according to one theory, we produce speech using an internal model of the mouth.

[Kashino - you mean an internal model of the mechanics of the mouth.]

The mechanics as well.

[Kakita-sensei has the idea that in order to produce speech we need to have a "spatio-mechanical" (my term) internal model of the mouth. Such an ability is interesting from the point of view of Abstract Perception, because such model is strictly neither visual nor auditory, but rather interestingly situated between them.]

2. Mr. Sakaguchi, University of Tokyo

I've found the workshop very interesting. I'm sorry that I haven't been participating more in the English discussions. One thing I learned that I should learn more English!

As far as improvements go: it seems that vision and audition people tend to stick to their own areas. The discussion may not be going exactly as Rosenthal intended.

[Kashino - what did you understand from the term "abstract perception?"]

I understood this to refer to algorithms and methods that might be common to vision and audition systems.

3. Prof. Inui, Kyoto University

[General Impression?]

The topic of the workshop is broad and abstract. Generally we focus on a more concrete theme. Also, there are language difficulties. Each participant should talk more about the specifics of their research problems. Researchers in vision are not familiar with the details of audition research, and vice versa. Vision has a framework for thinking about intelligent processes, while audition still seems to be involved with the signal processing level. I would like to know more about the audio processing systems discussed at this workshop. Abstract Perception itself is pointed in a good direction, so I expect that discussion will be more focused in the future.

[Issues properly raised?]

I will host an international conference on integration in the human brain this year. We're calling it "information integration," but we haven't settled on a clear definition for this term yet. We'll discuss integration of vision, audition, action, and language recognition. This issue is similar to that of this workshop. Presentation should be based on more concrete data - I know that this is difficult.

[Any ideas from the workshop that you plan to pursue further?]

I learned from the knowledge session that a number of other researchers use the energy-minimization method that I am interested in. I will talk to Prof. Matsuyama about "mutual information quantity."

[Other comments?]

Although the workshop was hosted by a Japanese institute, U. S. researchers played leading roles in the discussion. What does IMRF expect from this workshop? I would suggest inviting more Japanese researchers, and providing the opportunity for them to learn from foreign experts in certain specialties. In the past few years, many advanced foreign researchers are visiting Japan, despite Japan's current economic problems. The workshop must be more characterized more clearly to be significantly successful. I would recommend having brainstorming discussions with a few foreign researchers present.

[Would you participate in another Abstract Perception Workshop if there is one?]

Yes, if there is overlap with my interests.

____________________________

4. Prof. Akagi, JAIST

[General impression?]

I appreciate it. Japanese generally believe that auditory signal processing research is advanced in Japan, but this is not the case. Only statistically-oriented speech recognition is advanced. There are only a few researchers doing speech perception or audiology. I'm looking for potential colleagues in this field, and at this workshop, I met people that I could talk with, who provided me with interesting suggestions.

[How, in your view, could the workshop be improved?]

If we had another day, we would have had time to make 20 minute presentations. In 10 minutes, a scarcely have time to raise issues of interest to me, or to present my own research results. Perhaps if abstracts were distributed, and sessions were broken into more specific areas, we could participate more in the discussion.

[Issues properly raised?]

I am interested in signal processing, domain specific issues. But I understand the importance of integrating perception into a common approach. Through this process, the system should maintain some domain specificity. I think that there are domain-specific issues that still elude researchers in perceptual computing. These domain-specific questions should be pursued. Low-level, sub-symbolic processes are domain-specific.

[Any ideas from the workshop that you plan to pursue further?]

As I mentioned, I have few research colleagues in my field, and I will bring back reports from the presentations and discussion made here to my lab.

Most Japanese research follows leads established in the U. S.. I'm not always in favor of this situation, but here, it may be warranted. I basically believe that common perceptual processes are those responsible for integration. But we don't yet know what kind of information is shared, and how the process is initiated. The low-level processing issues must be made clear before we can properly address the abstract issues.

Interviewer: Dan Ellis

1. Dr. Ahmad, Interval Research

Time should have been more tightly organized during workshop sessions; each participant should have been limited to exactly 10 minutes. But on the other hand, it's good that everyone gets a chance to say something.

The discussion led by Nawab - a free discussion with a moderator, rather than a series of speakers - worked well.

The general organization was very good; organizing an international workshop is not an easy thing to do. Holding the workshop in a ryokan was a very good idea. For the American participants there was a another side to the workshop, that of being in Japan, being in the ryokan, the onsen, and so on.

As for the relevance of the workshop to my research interests, it is definitely relevant to my interests; otherwise I wouldn't have come. Abstract perception can aid research in specific perceptual areas. Applying techniques developed in the context of another modality always helps one understand [perceptual modeling problems] better. General principles are useful to everyone. One of the nice things about abstract perception as a field is that it forces one to look at general principles, rather than specific techniques in vision or audition. Actually, we didn't get into that very much; the discussion was more on the level of specific techniques from vision that might be useful in audition, for example. If you were to write a book on the subject, it would be good to try to create such general principles.

When you do manage to form these abstract principles, you often find that it's been studied by mathematicians or information theorists, which makes a whole new body of knowledge available that you can apply to your problem.

As far applying ideas from the workshop: I'll be working on problems in visual attention this year, and I think that work on attention from the auditory domain is relevant. What will I take back? Well I'll definitely be opening a book on audition!

2. Mr. Fujinaga, McGill University

The conference was very well organized, we were treated very well, and the food was excellent! The whole meeting was a good idea; considering that it's the first, it is going extremely well. The size is good; there is ample opportunity for one-on-one encounters. The idea of having the symposium on the first day is good, since it set the stage for the workshop. The biggest problem is that we don't have enough time.

As far as the panels go; people mainly present their own work, and don't always get to "crossover," that is, application of ideas from one domain to another. There are other modes of perception that aren't getting discussed, such as tactile perception and taste.

Rhythm - Dave's interest - is auditory perception on a larger scale, much as poetry, sentence understanding, or understanding ballet scenes might be considered speech understanding or vision on a larger scale. There's a general issue of "level" - not just level of abstraction, but level of time-scale, for example.

Maybe we need an "Abstract Perception Conference" - that is, a gathering where people actually have to talk about abstract perception. Psychologists do essentially talk about abstract perception - but we also need to have computational models.

The word "domain" apparently doesn't translate into Japanese with the right connotations. In general, there is a language problem, which inhibits Japanese presenters somewhat. Of course, if you go to a European conference, the main language will also be English. I wanted to hear more from the Japanese researchers. On the other hand, maybe it's not entirely a language problem; maybe if it was all in Japanese, the Japanese participants would have been quieter anyway.

Abstract perception is relevant to my interests in that I want to get input and ideas from as many sources as possible. If I was only interested in finding out what I could from vision people, I could just go to a vision conference. But the issue that I'm interested in - perceptual learning, which I maintain is essentially an abstract perception idea - probably wouldn't get discussed there.

In general I think that the one-on-one conversations were very valuable for me. It will probably take me a little while to digest what I learn here.

3. Dr. Okuno, NTT

It's difficult to get Japanese researchers to actively participate in these discussions. From my point of view, my background is in information processing. If the discussion is about specific aspects of visual and auditory perception, it's difficult for me to contribute. When we talk about information-processing aspects of vision and audition, I'm able to contribute more.

When we look at computer implementations of audition and vision, we can examine the issues of similarity more closely. For example, today, Prof. Matsuyama presented a multi-agent approach to vision. My work involves a multi-agent approach to audition. In this case, we can see some commonalty.

I also want to point out the differences between hearing vs. listening, or seeing vs. looking. When we attempt to design real-world systems, we have to consider issues of attention.

As for what I would have done differently: I would have had a session on differences between audition and vision. This might very well lead to a clearer picture of what the commonalties are; if we only look for commonalties, we might end up saying nothing of importance.

Another idea: everyone gives a presentation entitled "Why My Area Is Difficult." This might cause some commonalties to become evident. This might also help to educate conferees about specialties that they don't know about.

As far a applicability to my own research: As I mentioned, I think that attention is an important issue - and it is also a domain-independent issue. Attention requires the use of several modalities; we cannot consider it to be solely a vision problem or and audition problem.

As far as taking things back from the workshop: I will have to digest what I've picked up here; I won't be applying it immediately. Workshop ideas will probably indirectly affect my work. Most important is the networking aspect - getting acquainted with various people, opening up new channels of communication. Then even if I don't understand something, I at least have a pointer to a person.

4. Mr. Klassner, University of Massachusetts

In terms of general organization, the informal atmosphere was very effective for promoting exchange of ideas, and the size was just right. There was ample opportunity for productive one-on-one encounters.

I would have liked to see more time for individual discussions. A certain amount of time has to be spent "breaking the ice." You need to know who people are in order to communicate with them effectively.

As far as the panels go: there should be a clear distinction between discussion and presentation. This is not the place to simply present your own work on vision or audition. Participants should be more "radical," willing to present possibly controversial ideas. This sort of thing worked better in the afternoon session.

The work discussed here is highly relevant to my interests. I have gained an understanding that there are in fact common problems between audition and vision, though these may not extend to implementation details.

5. Prof. Nawab, Boston University

In terms of a grade, I'd give it [the workshop] an A! The only problem was that panelists tended to view the sessions as an opportunity to present the own work, and hence overran their time. On the plus side, it was a very good and diverse group of people, with a good range of opinions in both vision and audition.

As far as the over-running problem goes, I'd suggest that each participant be limited to one transparency (for a ten-minute talk) so that the presentation is forced to be simple and clear for the audience.

People need an opportunity to describe their personal work, however. For that purpose I'd suggest having poster session - that is, each person sets up a booth, where they explain their own work. People could then look into each presentation at their own pace. I've seen this format used successfully at other workshops - a poster, followed by one transparency per presentation. Panel discussions should be very actively moderated to ensure that as many people as possible are involved.

I believe strongly in the importance and value of domain-independent approaches to perception. By developing a system that can be applied to both vision and audition, the design is forced to be clean and clearly understood.

At the moment, solutions in both vision and audition tend to be very messy and confounded between different kinds of processing. domain-independence is a discipline to focus the mind in distinguishing different parts of the system.

I think that the communication between the vision and audition people was very valuable and potentially beneficial to both. Their subtly different approaches to essentially similar problems can be very instructive and inspiring. I wouldn't specify any explicit ideas that I'll be taking home from the conference, but I would emphasize that wider familiarity with research in various areas of perceptual modeling is extremely important and influential in guiding my own work.

Appendix B: Contact Information for Participants

Touru Abe, Japan Advanced Institute for Science and Technology, beto@jaist.ac.jp

Ted Adelson, MIT Media Lab Perceptual Computing Group, adelson@media.mit.edu

Subutai Ahmad, Interval Research, ahmad@interval.com

Masato Akagi, Japan Advanced Institute for Science and Technology, akagi@jaist.ac.jp

Hiroshi Ando, ATR Human Information Processing Laboratories, ando@hip.atr.co.jp

Aaron Bobick, MIT Media Lab Perceptual Computing Group, bobick@media.mit.edu

Dan Ellis, MIT Media Lab Perceptual Computing Group, dpwe@media.mit.edu

Ichiro Fujinaga, McGill University Faculty of Music, ich@sound.music.mcgill.ca

Yuzuru Hiraga, University of Library and Information Science, hiraga@ulis.ac.jp

Susumu Horiguchi, Japan Advanced Institute for Science and Technology, hori@jaist.ac.jp

Toshio Inui, Kyoto University, inui@kuis.kyoto-u.ac.jp

Yuki Kakita, Kanazawa Institute of Technology, kakita@manage.kanazawa-it.ac.jp

Kunio Kashino, Tanaka Laboratory, University of Tokyo, kashino@mtl.t.u-tokyo.ac.jp

Frank Klassner, University of Massachusetts, Distributed Problem Solving Laboratory, klassner@cs.umass.edu

Tom Knight, MIT AI Lab, tk@ai.mit.edu

Takashi Matsuyama, Okayama University, tm@chino.it.okayama-u.ac.jp

Hamid Nawab, Boston University, hamid@engc.bu.edu

Hiroshi Okuno, NTT Basic Research Labs, okuno@ntt-20.ntt.jp

David Rosenthal, International Media Research Foundation, dfr@media.mit.edu

Alan Ruttenberg, MIT Media Lab Learning and Common Sense Group, alanr@media.mit.edu

Yutaka Sakaguchi, Department of Mathematical Engineering and Information Physics, University of Tokyo, sak@bcl.t.u-tokyo.ac.jp

Shigeyoshi Shimotsuji, Toshiba Information and Communication System Laboratories, gajira@isl.rdc.toshiba.co.jp