_______________________________________________________________________________________________________
Recent research has resulted in a number of computational systems that emulate of some aspect of human perception. Most of these can be broadly classified as machine vision or machine audition systems. The purpose of the Abstract Perception Workshop was to explore a straightforward idea: that these various systems have some interesting common aspects, and that understanding these commonalties can lead to improved understanding of how perception works in humans, and how we can endow computers with similar abilities.
To further our understanding of these issues, we gathered an international group of researchers in various fields related to computer perception for three days of lectures and intensive discussion. The event consisted of two main parts: The first day was taken up with a symposium at the JAIST-Ishikawa Hi-Tech Center, at which four American researchers, two each from audition and vision, presented overviews of some of their recent research and ideas. The symposium was open to the public, and was attended by an audience of about 200, mostly drawn from the Ishikawa-area research community. The second part consisted of a closed workshop held at a nearby ryokan. Over the two days following the symposium, about twenty researchers in vision and audition essentially compared research notes and explored issues of commonalty among their fields. The workshop consisted (formally) of three planned sessions plus an extraordinary session led by Professor Nawab of Boston University.
This document consists of summary accounts of the symposium and the four sessions. It also includes some notes from individual interviews that were conducted with most of the participants, followed by some of my own conclusions about overall directions and future work.
[Note: The diagram on the cover is a representation of the sound of a
clarinet being interrupted by a can dropping on the floor, produced by
Dan Ellis' dspB sound understanding system.]
Acknowledgments
I'd like to sincerely and gratefully acknowledge the contributions of the following people:
IMRF Board of Directors Chairman Minoru Yoshino, Secretary General Seichi Koizumi, and Dr. Edward David, for their generous support and encouragement of this project,
Prof. Susumu Horiguchi and the JAIST staff for chairing and running the symposium, for helping to organize the workshop, and for able and indefatigable support throughout the planning and execution of this event,
Mr. Kazuo Ohno, Mr. Keiji Yumiyama, Ms. Yoshiko Sasaki, Ms. Yoko Koizumi, for well-thought-out administration and numerous behind-the-scenes activities that were crucial to the success of the workshop,
Mr. Kunio Kashino, for the contribution of his ideas and knowledge of the Japanese research world, and for help during the workshop,
Mr. Dan Ellis and Mr. Alan Ruttenberg, for their constant help and encouragement,
Professor Takashi Matsuyama and Professor Seiji Inokuchi, for sage advice during the planning stages and invaluable support,
Symposium presenters Professor Edward Adelson, Professor Aaron Bobick, Professor Hamid Nawab, Dr. Thomas Knight, for their interesting and well-planned presentations that made the symposium a success,
The workshop participants, who made time in their busy schedules to travel to and participate in the workshop, for their stimulating and insightful contributions.
Representation in Dynamic Scenes
We'll consider three application domains: autonomous navigation (where the world is static and the observer is fixed), ballet choreography (fixed observer, moving world), and automatic interpretation of (American) football plays (where both observer and world are in motion).
Part I: Autonomous Navigation:
The goal in this project was to develop a system which could use vision to build a structural description of the world which was robust enough to use for navigating a land vehicle. The basic system is as follows: The autonomous land vehicle (ALV) is provided with a rough 3D model of the terrain to be navigated. The model contains gross terrain features such as large trees or rocks; it's derived from a stereoscopic aerial photograph. Mounted on the vehicle is a range finder which provides range data to the navigation system every half second. The system uses the range finder data to fill in the rough model with enough detail about smaller features (rocks, bushes, etc.) to navigate.
The basic idea is to "integrate" the range data over time to avoid gross errors. An aggressive segmenting system is used, i.e., whenever a range image indicates that an object might be present, the system conditionally hypothesizes that it's there; if it is in fact present, its existence will be confirmed in subsequent images, otherwise the hypothesis can be discarded.
This means that a key element of this system is "keeping track" of everything - knowing, for example, whether a blob detected in some image is one that we've detected in a previous image, or a new one.
To handle this kind of situation we need to use object descriptions which can evolve, as shown in this figure. When an object is initially detected by the range finder, it will only be a few pixels in size; we can't say much about its attributes. As subsequent information is obtained, it can be more richly represented - first, we may get some information about its 3D position, then we may get information about its size, then its texture and shape. Still later, we start to be able to build a standard computer-graphics model of the object, where we can talk about its parts and structure.
A key notion in this system is that of stability - that is, as more sensor input about an object is obtained, it should evolve in a coherent way toward a single part model. Knowledge-based processing can enhance stability: when an object description fails to evolve in a stable manner, the system may be able to explain the problem in terms of a sensor characteristic, for example.
The ALV navigation problem involves multiple tasks, such as correcting the initially provided model, localizing the vehicle, avoiding obstacles, predicting what objects will look like from other points of view, and recognizing objects that have already been detected. Each of these tasks requires a different level of representation; to recognize an object from a different point of view, we need a detailed part model, while to localize the vehicle, a simpler representation may suffice.
Conclusions for Part I:
Part II: Dynamic Scene Annotation
Now let's look at some current work in automatically generating annotations of video sequences.
The first domain that we'll consider is the annotation of classical ballet. One reason that classical ballet lends itself to such an approach is that there are a number of strong constraints on what can happen in a scene. First of all, we know that the interesting things in the scene will be the motion of human beings, and therefore we can use the constraints of the human body: we know, for example, that limbs are attached to bodies in certain ways, have certain degrees of freedom, and so on. Another important fact about classical ballet is that there is a predefined language for describing it. Someone who is trained in ballet can unambiguously describe any classical ballet scene in terms of a known notation for choreography.
Suppose we want to represent a given motion of the body - say a motion of the leg. There are a number of alternative schemes for doing this. A problem is that each of these representations has a singular point - that is, a point in representation space, in whose region the representation breaks down. Different representation schemes have their singular points in different regions. This means that we can avoid the problems of singular points by using several representations. When we are near the singular point in one representation, we switch to another which is non-singular in that region.
The other domain that we'll consider is the automatic annotation of video sequences of (American) football plays. Football, like ballet, is a tightly constrained dynamic environment; the number of things that can happen is, in some sense bounded and reasonable. Furthermore, there is actually a commercial application for a technology that can automatically figure out football plays from video; millions of dollars of video annotating and editing equipment is sold in the U. S.. In the football videos available to us, both the camera and the players are moving. But we can remove the motion of the camera by stabilizing the lines on the field. We can choose some line - say, the twenty-yard line - and reprocess the video so that that line remains fixed. Another thing that we can do is transform the scene into a coordinate system where the yard lines are vertical - the "blimp's-eye" view of the stadium. The transformed video sequence is much more amenable to analysis; such tasks as finding the line of scrimmage, locating the quarterback, and so on, become much easier.
Conclusions:
_______________________________________________________________________________________
Layered Representations in Vision
Prof. Adelson talked about two related topics: vision and image coding. He began his talk with a short discussion of pyramid coding: representing visual information as "pyramids" of images with different amounts of detail. He pointed out that while the image coding community was not initially very enthusiastic about this scheme, it eventually became the basis for widely used commercial image coding schemes, such as that used by Kodak to represent images on CD-ROMs.
Prof. Adelson then presented the outline of vision research shown in the figure above. Low-level vision and coding, he said, is fairly well understood, and research from this area has been applied, for example, to representation schemes and compression schemes such as JPEG and MPEG. Prof. Adelson's recent interest and work is in mid-level vision.
An important mid-level vision concept is that of layers. Prof. Adelson illustrated the idea of layers with a number of visual illusions and effects.
The Kanisza illusion shown above demonstrates one way that the vision system perceives objects that aren't really "there;" we have a strong sense that there is a white square in this figure, though there is no direct evidence for it (such as an edge). This is an example of how our vision system tends to interpret the visual world in terms of a number of overlapping layers.
Another is shown in the figure above. Here we interpret what we see as two overlapping transparent squares. Standard vision processing techniques, however would separate the figure into three smaller figures, by grouping together pixels with similar gray levels, as shown in the next figure:
What we would like is for a vision processing system to segment the scene into two squares, as follows:
Prof. Adelson used a number of other visual illusions to illustrate the importance of layers, and the strong tendency of the human visual system to interpret scenes in terms of layers.
Prof. Adelson also pointed out that layered representations are important when one object is moving in front of another. He illustrated this point with a short video from the MPEG test suite, known as the "flower garden" sequence. In this scene, the camera is moving past a tree, a flower garden, and a house. The tree, garden, and house are all different distances from the camera, and therefore move at different velocities across the scene. Prof. Adelson explained techniques for segmenting video sequences into layers, each of which is moving uniformly with respect the rest of the scene.
Prof. Adelson (together with John Wang, also of MIT) was able to segment the MPEG flower garden sequence into layers corresponding to the tree, garden and house. The scene can then be reconstructed by moving static images of the layers in front of the viewer in an appropriate manner. This essentially amounts to a significant compression of the video sequence, since all we need to store are the still images together with some information about how they should be moved relative to each other. The compression is not lossless, but the reconstructed scene is nevertheless quite convincing. The layered representation also allows for interesting effects such as removing the tree from the scene.
Layered representations are also useful for converting frame-rates. It's often necessary to convert video or motion-picture sequences from one frame-rate to another. Traditional techniques for doing this - essentially, interpolation techniques - don't produce satisfactory results (they produce blurred or double images). Using a layered representation, frame-rate conversion can be accomplished in a more natural and effective way.
In sum: in both biological and computer vision, the problem is to generate representations which are both useful and efficient. We'd like to move beyond low-level, signal-processing-based methods, to mid-level and someday high-level methods. We can look forward to a future where these more advanced methods with play a role in coding and representing images. _____________________________
IPUS: An Architecture for Integrated Signal Processing and Signal Interpretation
Prof. Nawab talked about the IPUS ("Integrated Processing and Understanding of Signals") architecture, designed and built by a group headed by himself and Prof. Victor Lesser of the Distributed Artificial Intelligence Laboratory at the University of Massachusetts, Amherst.
The IPUS system was built to process and understand natural sound, with particular focus on non-speech sounds: sounds such as doorbells, vacuum cleaners, hair-dryers, and so on that might naturally occur in some environment. Prof. Nawab contends that the basic concepts behind IPUS are applicable to other, perhaps non-auditory, domains as well.
Traditional machine perception architectures can be diagrammed as follows:
The key point here is the one-way flow of information between the "front-end" module and the "interpretation knowledge sources" module. This means that once the initial signal processing is performed, the front-end is no longer concerned with that particular input. Whatever it produces must be adequate for the subsequent higher level processes to do their job.
There are problems however, with using a fixed front-end strategy in a real world environment. Signal-to noise ratios vary, several objects may occur at once, signal signatures may interfere with each other, and object behavior is difficult to characterize and predict. As a result, a fixed front end (or signal processing algorithm, SPA) is not in fact adequate for real-world tasks.
Prof. Nawab illustrated some instances of how unitary SPA's break down in real-world situations. He also noted that one of the reasons for the failure of unitary SPA's is the Uncertainty Principle - in a linear signal processing system, resolution in frequency must be traded off against resolution in time. Different signal understanding tasks, however, have different time and frequency resolution requirements.
Prof. Nawab outlined a number of approaches to the problem that various investigators have used:
Prof. Nawab contends that due to the model variety problem, using multiple SPAs to try to cover every contingency is impractical - the real world consists of too many such contingencies.
The IPUS approach is control of the interaction between interpretation KSs and SPAs.
In this model, the space of interpretations and the space of SPAs are searched simultaneously. The system attempts to converge on both an SPA which produces data appropriate to the circumstances, and a consistent interpretation of the data. A key component of this strategy for dynamically finding appropriate SPAs is discrepancy detection. Prof. Nawab distinguishes three classes of discrepancies:
Prof. Nawab gave examples of these three kinds of discrepancies.
Once a discrepancy has been detected, it must be diagnosed. A diagnosis is a sequence of operators where each operator represents a signal processing effect. One operator, for example, might embody the knowledge that a certain SPA will not distinguish frequencies that differ by less than a certain amount. Such an operator corresponds to a hypothesis that certain frequencies were present in the input which are not present in the output.
A goal of differential diagnosis is to remove ambiguity. For some inputs, the sound understanding levels will produce several competing explanations. Differential diagnosis attempts to adjust the SPAs so as to converge on a single explanation. Prof. Nawab illustrated the disambiguation procedure with the following example:
In this example, we are assuming that the signal processing output (on the right) was produced either by Source A or Source B, but we don't know which. Source A has two relatively close frequency bands in the 1200 Hz range, while Source B has a band at about 1200 and another, lower energy band at about 2200 Hz. Either source could have produced the output one the right - if Source A produced it, the problem is that the algorithm being used is failing to resolve the two frequency bands; if Source B produced it, the problem is that the lower energy band is below a power threshold, and consequently being eliminated.
In this case, the differential diagnosis strategy proceeds as follows:
Prof. Nawab concluded with a few overall facts about the program (it
consists of 1500 Kb of LISP code; the source library has 35 real-world
sources such as hair dryers, footsteps, fire alarms, etc.). He also
stated that he believed that while the principle application of the
IPUS architecture to date has been in the audio domain, that it has
also been applied to the radar domain, and he believes that it could
be appropriate to other domains as well.
_____________________________
Lessons in Perception from Mammals
Tom Knight, MIT AI Lab
There are a number of tasks that humans (and other mammals) are able to do, that we'd like to understand better:
While many techniques emphasize regnerability, it should be pointed out that what we really want to do is discard most of the information, keeping only what's important. This is a non-obvious task; for example, the fact that one can discard some part of a speech signal and still have it understood by a human listener does not necessarily mean that the discarded part wasn't important.
Most current audio processing efforts - and in particular, most speech understanding efforts, make exclusive use of linear processing schemes: FFT's or constant-q filter banks. Note that linear schemes will always have to contend with the uncertainty principle: resolution in the time domain must be traded off against resolution in the frequency domain.
What, on the other hand, do we know about mammalian auditory physiology?
Though it seems obvious, we should remember that we have two ears. (Most current speech understanding research is monaural.)
The mechanical part of our hearing system can be divided into three stages: The outer ear, which acts as a passive filter and is useful for determining source location, the middle ear, which may be thought of as an impedance transformer together with an automatic gain control (the stapedius muscle), and the cochlea.
The cochlea may be thought of as a cylinder divided lengthwise into two fluid-filled chambers, separated by the basilar membrane. The basilar membrane is narrow and thick at one end, and wide and thin at the other end, meaning that certain sections of it are tuned to certain frequencies. It's therefore possible to think of the structure as a mechanical filterbank.
The tuning curves for this filterbank can be experimentally measured. The filters are band-pass with extremely sharp cutoff at the high end, and very flat cutoff at the low end.
The low end cutoff of each tuning-curve is non-fixed, due to a phenomenon called two-tone inhibition: if two tones are within the pass band of one of the filters, the filter will only respond to the higher frequency tone. The place theory of hearing says that the perceived frequency of a sound depends on which nerves attached to the basilar membrane are most active.`
If we look at the firings of the nerves which are attached to the basilar membrane, we find that the periods of the nerve firings tend to match the peak amplitude times of the incoming sound - in other words, the frequency information, in addition to being encoded in which nerve is firing (the place theory), is also encoded in how often the nerve is firing - the period theory. What follows are some speculations on as to how this system can finesse some of the limitations imposed by the Uncertainty Principle.
Suppose we have two signals, A and B, in the pass band of one of the cochlear filters, where the frequency of A is lower than that of B. B, in other words, is closer to the sharp high-frequency cutoff of the filter. The filter will respond more quickly to A than to B.
Now consider the case of two overlapping filters, X and Y, where X's cutoff is lower than that of Y. We apply signal A, which is in the passband of both filters. A is close to the band edge of signal X than that of signal Y; therefore Y will respond more quickly to it than X will. Note that A's frequency will be encoded on the fibers attached to both filters.
Now consider the case where we have two overlapping filters, X and Y, (with X's cutoff lower than that of Y, as before), and signals A and B, where A is in the passband of X and Y, and B is in the passband of Y, but beyond the cutoff of X. Because of two-tone inhibition, Y will not respond to signal A. In other words, Y responds only to B, and X responds only to A. In both cases, the frequency is encoded in the nerve firing rates. Also, in both cases, the response will be relatively slow, since each signal is close to the cutoff frequency of the filter that's reporting it.
In conclusion: we can, at least in a certain sense, have both high frequency resolution and high temporal resolution, though the temporal resolution is lower in the case of two signals that are close in frequency.
Dr. Ahmad then proposed that given that the two areas appear to be very similar, we might look for common principles by which they operate. In this connection, Dr. Ahmad suggested that we look at certain connectionist research: in particular, the Hebb rule/maximal information preservation, and Kohonen maps. Dr. Ahmad observed that when a visual stimulus is input to a Kohonen map, structures are developed which resemble those actually present in the visual cortex.
Dr. Ahmad then mentioned some results concerning various low-level visual learning experiments, such as learning to discriminate orientations and learning to read at an angle. Several of the vision researchers present were familiar with these experiments, and there was some discussion of the issue.
Dr. Ahmad then went on to mention the "binding problem," that is, the problem of integrating information across several topographic maps. He stated that attention is thought to play a role in this connection. There appear to be particular areas of the brain that play a role in attention - the pulminar and the superior colliculus. This mechanism may be the same for both audition and vision.
_______________________
Prof. Akagi concluded by suggesting that better methods for auditory segregation might be possible if we can integrate time-domain techniques with the more usual spectrogram techniques. As far as commonalties go, Prof. Akagi concluded that vision are audition are not that similar at the lower levels of processing, where he believes, sound source separation takes place.
[Prof. Matsuyama commented that vision and audition, might be more similar when the time axis is taken into account in vision (as is often not the case), and that in this case, a technique similar to Prof. Akagi's might be useful in vision processing.]
_________________________
Dr. Ando then proceeded to discuss relations between vision and audition from each of these three points of view.
Computational Theory:
The environment - natural objects and events, written and oral language, art, music, etc. - sends input to both the visual system and the auditory system. The visual input consists of light intensities and color naturally represented in 2D space and time, and the sound input consists of sound pressure waves, naturally represented as amplitude and time. Dr. Ando notes that these primary representations of images and sounds are basically different; they have a different dimensionality, for example. The goals of the visual and auditory systems, on the other hand, are similar; in both cases we are trying to interpret the information so as to take actions appropriate to the environment.
Representations and Algorithms:
The level of representations and algorithms may be further divided into three sublevels:
Low-level processing:
In the initial stages of vision processing, there is a local smoothing of spatio-temporal derivatives; at this stage the system also detects some discontinuities, fills in some missing data, and segments the scene into transparent layers. Dr. Ando conjectures that local smoothing of temporal derivatives may also play a part in low-level audio processing, and that the correlate of the filling-in and transparency operations would be sound source separation.
Dr. Ando then demonstrated the notions of filling-in and discontinuity detection in vision by briefly explaining some work in the interpretation of visual motion (Geman and Geman, 1984); the work concerns motion detection processes thought to be part of low-level vision.
Mid-level processing:
In vision, this level is concerned with the reconstruction of the 3D structure and motion of objects, using such cues as stereo, motion, shading and texture and such physical constraints such as smoothness and rigidity. We also reconstruct the surface properties of the objects in the scene, and separate reflectance from illumination. In audition, at this stage, we reconstruct the location and motion of sound sources, and attempt to reconstruct their material properties - the sounds of breaking, cracking, crunching, tapping, and so on, for example, are clues as to whether the material in question is wood, steel, glass, rubber, etc.. Dr. Ando was unsure as to the role of physical constraints in audition. [This question generally provoked a good deal of discussion during this session.]
Dr. Ando then illustrated mid-level vision processing with some of his own work on recovering surface structure from motion.
High-level processing:
In vision, one high-level processing tasks is the recognition and categorization of 3D objects. This involves extracting information from an object's image which is invariant across different viewpoints, and both supervised and unsupervised learning. Another high-level vision task is active vision - visually guided actions such as reaching, grasping, manipulating, navigating, and so on.
In audition, the corresponding task is the identification of sound sources. This, too, involves the extraction of invariant features, and various kinds of learning. We can also talk about active audition, that is, using auditory information to guide action: friend/foe recognition, navigation, etc..
Dr. Ando then outlined an approach to visual 3D object recognition and invariant feature extraction.
Hardware Implementation:
In this connection Dr. Ando mentioned work by (Sur, Pallas and Roe, 1990), which is a follow-up to the work on ferrets mentioned by Dr. Ahmad.
___________________________
D. Ellis, MIT Media Lab Perceptual Computing Group
Mr. Ellis presented an overall model of psychoacoustic grouping; the scheme first re-represents the sound, using a constant-q filterbank that models the cochlea, as a set of tracks, which are contiguous high-energy regions in time-frequency space. The tracks are then hierarchically grouped in such a way that the top-level groupings correspond to RW sound sources.
Mr. Ellis then proceeded to "abstract" this model, that is, to construct a generic perceptual model of which the psychoacoustic grouping scheme may be considered a (partial) instantiation.
Generic perceptual processing
Mr. Ellis suggests that the underlying reason behind the similarities in vision and audition is that both senses are aimed at maintaining an internal model of the world.
In the ensuing discussion, one of the questions raised was how closely machine realizations of perception match human perceptual processing. This led to the related question, "how many different ways are there of accomplishing perceptual tasks?" - so for example, if there's only one, then any successful perceptual model must be essentially the same, and successful machine models will necessarily work the in same way as human perception.
___________________________
Prof. Hiraga took the opportunity to reconsider the discussion that had taken place up until the time of his talk.
Prof. Hiraga first suggested that we had not yet found the right level of discussion to talk about similarities and differences in perceptual modes. He suggested that the level of abstraction considered by Ahmad - the level of neurons - was too low. Rosenthal's level, on the other hand, was too abstract.
Vision, and audition, Prof. Hiraga suggested, might be inappropriate examples for comparison of perceptual modes. In vision, what we sense is a passive property of the perceived object - the way that it reflects light. In audition, on the other hand, the stimulus that gives rise to the sense comes from the vibration of the source.
Prof. Hiraga suggested that at the lower levels of perception, the modes were quite different. Commonalties might arise at higher levels, where invariant features are extracted. Many investigators (e.g. Gibson) have pointed this out.
There may be some connections at intermediate levels, as evidenced by certain analogies (e.g., a "bright" sound). Prof. Hiraga did not think that these kinds of connections would prove very interesting, however.
__________________________
Mr. Shimotsuji contended that "real" machine perception systems - those which have practical applications - make such heavy use of the context of the application that there cannot be any domain independence.
Mr. Shimotsuji illustrated these points with two systems developed at Toshiba. The first system is one which automatically reads cable-routing diagrams. Electricity providers in large cities maintain extensive diagrams of underground cables, and they would like for these diagrams to be machine-readable. The Toshiba system leverages constraints on the format of these diagrams to determine the meaning of their various features. The second application is one which interprets video images taken from a camera in a supermarket. This system also makes use of information particular to the environment in question: which parts of the scene can be expected to be stationary, which parts move, and so on.
Practically successful systems are those in which the system essentially classifies the input in terms of known patterns. The basic problem facing the builders of such systems is to extend them so that they cover enough cases to meet the demands of the user.
__________________________
Session 2: Knowledge (Moderator: E. Adelson)
How does knowledge aid perception? How do we integrate signal processing and symbolic processing modes? How do we integrate top-down and bottom up processing?
Prof. Inui stated that the basic problem in image processing is the reconstruction of the 3D world from 2D images. He pointed out that this problem is generally ill-posed - i.e., there is not actually enough information in 2D images to reconstruct the 3D objects which gave rise to them.
The fundamental approach taken by Prof. Inui to this problem is to minimize an energy function of the form:
Here it is assumed that we have a 3D model of a visual scene which should match the scene actually presented to the vision system; the degree to which is fails to match (if, for example, the 3D model would predict a bright spot in an area where no bright spot is actually detected) is represented in the "data fitting" term. We also make some assumptions about the nature of the 3D objects involved, for example, that they are relatively smooth. The degree to which the model breaches this assumption is represented in the "smoothness constraint" term. l represents the relative weight given to the two considerations.
Prof. Inui showed a complex map that he and his colleagues have developed of the visual neural pathways of the brain. He noted that all of the connections between various processing centers represented in the map are reciprocal - that is, the data flows in both directions.
Prof. Inui's working framework is that the complete inverse optics function, R-1 doesn't exist, and that the brain therefore calculates an approximate inverse optics R#. From this, the brain develops an internal model of the visual world, from which it can compute a forward optics function, R, whose result can be checked against the original image. The fact that there are two computational directions involved in this model (R# and R-1 ) is consistent with the bi-directionality observed in the neural pathways.
Prof. Inui has implemented a model of vision along these lines, which exhibits robust performance on shape-from-shading tasks.
Prof. Inui concluded this part of his talk with the following observations:
In the second part of his talk, Prof. Inui illustrated the role of knowledge in vision with descriptions of an experiment by (Gilchrist, 1977), and a number of optical illusions.
[In the ensuing discussion, Prof. Matsuyama asked how l is calculated in the energy minimization equation. Prof. Inui replied that he didn't know, but that the brain apparently computes it dynamically, based on an estimate of the amount of noise present.]
__________________________
Mr. Kashino then outlined some approaches to knowledge representation used in machine perception systems, e.g.:
Thirdly, Mr. Kashino posed the question, "How can we control knowledge-based processing. He outlined three approaches:
Mr. Kashino then illustrated his points by describing the sound source separation system that he is working on. He contends that perceived sound is organized hierarchically, as in the following diagram:
At the lowest (leaf) levels of such a hierarchy sound may be described in terms of parameters (e.g., this much energy in this time-frequency region, for example). The problem of separating the sound according to source, then may be seen as one of obtaining parameters for the individual sound from the set of parameters for the composite sound. Mr. Kashino proposes a model of sound source separation represented in the following diagram:
A good deal of the knowledge necessary for sound source separation is in the form of "cues" that several components should be grouped together. Some useful cues are:
Mr. Kashino then illustrated his scheme for clustering sound components, which uses the Dempster-Shafer rule. His clustering methods are consonant with psychoacoustical data obtained from experiments that Mr. Kashino ran. Finally, Mr. Kashino discussed the results of some benchmarks tests of his system, and outlined some future directions.
____________________________
Frank Klassner, University of Massachusetts, Distributed Problem Solving Laboratory
The IPUS (Integrated Processing and Understanding of Signals) system on which Mr. Klassner has worked follows this strategy. KSs - that is, programmatic embodiments of knowledge of the domain, such as a program that generates words from strings of syllables - are used to control the initial processing of the input.
_______________________
Relations between objects constrain the interpretation of a scene. The process of detecting features and then determining their relationship is one which is used at all levels of vision processing.
Implicit in this scheme, however, is the idea that image features will be accurately detected. What kinds of schemes can support the possibility of a feature's not being properly detected? One such scheme is a multi-agent, or cooperative reasoning scheme. Prof. Matsuyama noted that the same kind of problem arises in audio processing, and the same kind of solution has been shown to be useful.
Prof. Matsuyama addressed the question "How do we integrate bottom-up and top-down reasoning?" He illustrated two kinds of processes that are part of the SIGMA vision system. In the first, the "bottom-up" case, a feature s is detected which is normally in a certain spatial relation to a feature f(s). Furthermore, a feature t with the appropriate characteristics is detected in the position where f(s) is expected. We then establish the relation REL(s,t).
The second case is the "top-down" case, the feature s is detected, but no corresponding t is detected. Furthermore, another detected feature, u, also expects t to be in the same place. We then hypothesize the existence of t. We may, in turn, lower some detection threshold in the region where t is expected.
Prof. Matsuyama illustrated this principle in the context of an aerial photograph interpretation problem, where a house and a road played the roles of s and u, respectively, and a difficult-to-detect driveway played the role of t.
Finally Prof. Matsuyama presented some recent research on using a cooperative-agent scheme to segment a scene into regions. He illustrated two cases, in which there were two different cooperation schemes; the one where agents were better informed about each other's state segmented the scene in a more satisfactory manner.
_________________________
Mr. Sakaguchi's interest is in sensory integration and active perception. In active perception, an array of sensors of various types are used in a goal-driven manner by a mechanism which is trying to establish particular facts about the outside world. This mechanism corresponds, in an obvious way, to the mechanism of attention.
Mr. Sakaguchi's model constructs an internal image of the presented object. The internal image has an entropy, or ambiguity value associated with it. The model decides which sensor to use according to the "mutual information" criterion, that is, according to the criterion of attempting to reduce the entropy of the internal image of the presented object.
Mr. Sakaguchi illustrated these principles with two systems that he has worked on. One system is a haptic recognition system, which attempts to distinguish among various kinds of materials using touch: a sensor attached to a robot arm rubs the material; various sensors from the robot are input into a model of the kind shown above. The other is an active vision system; it attempts to recognize figures by moving an optic sensor to an appropriate position.
In conclusion, Mr. Sakaguchi surveyed the role of knowledge in sensory integration and active perception.
Following Mr. Sakaguchi's presentation, there was some discussion of the role of learning, and it's relation to how knowledge is used in perception.
_________________________
Extraordinary Session: "What Can Audition People Learn From Vision People," led by H. Nawab
In view of the fact that there is more history to the vision problem than to the audition problem, Prof. Nawab felt it would be interesting to pose the question, "what can audition people learn from vision people?" In particular, what mistakes have vision people made that audition people might be able to learn from? What, on the other hand, is generally agreed to have worked well in vision?
Prof. Bobick answered with an account of intrinsic images. An intrinsic image is a set of registered "images" describing scene surfaces with respect to depth, surface orientation, reflectance, and incident illumination. It was believed at one time that intrinsic images could be calculated by independent modules, such as "shape from shading" or "structure from motion." But vision researchers don't generally regard intrinsic images as a good overall strategy anymore; these modules, it's now generally thought, cannot be independent.
Bobick also pointed out that a factor which played a role in this development was a movement, in the early 1980's to divorce AI from vision.
Tom Knight observed that lack of sufficient computational power was an important factor in vision development. Small memories and slow computers mean that only a few examples can be tried, and it's not practical in many cases to investigate how well two algorithms might work in tandem. Audition, he pointed out, involves order-of-magnitude lower information bandwidth than vision.
Prof. Matsuyama pointed out that since vision research seems to have been more or less stalled in the last 15 years, a new approach might be in order. Until now, most vision research has taken place on static images. Perhaps dynamic vision - that is, considering how scenes change in time - is such a new approach. Prof. Matsuyama pointed out that very little is known about dynamic vision; it's not known, for example, how much static vision research is applicable. Since audition research, because of the nature of sound, has never had any choice but to be dynamic, perhaps this is an area where vision people can learn something from audition people.
Prof. Adelson made a somewhat similar point. In vision, he said, it has generally seemed meaningful to talk about point-wise properties in an image. One intrinsic image idea, for example, is to say something like "this pixel corresponds to the surface of this object, which has this color, this distance." In audition, on the other hand, one doesn't point to a voltage at a given time and say "that's a clarinet;" a clarinet is only reflected in a complex of voltages. But perhaps this difference is really illusory; recent work which considers transparency or motion indicate that in vision, as well, it is a mistake to talk about pointwise properties. "Vision people have been able to fool themselves," he said, "while audition people have never had this illusion."
Rosenthal noted that while Bobick had criticized some vision research, particularly "implicit images," as being over specialized and insufficiently integrative, that audition research appeared to have a similar problem - most research is directed to one very specialized area, to wit, speech understanding.
This led to a discussion of the ARPA Speech Understanding Project, particularly, a well-known incident in which a speech understanding system interpreted a cough as the phrase "pawn to king four." Dr. Knight noted that the ARPA project corresponded to a phase during which very little attention was paid to lower level, signal processing issues, in the belief that all of these problems could be solved with knowledge-based processing at a higher level.
Dr. Knight also contended (addressing the earlier discussion of Rosenthal and Bobick) that independent modules for processing special kinds of sounds (such as speech) might still be viable, though the issue of coordination between them can't be completely ignored.
Ruttenberg raised the issue that we can understand black and white cartoons even though there are no issues of color, transparency, etc..
Adelson noted that the most-often-used basic representation in audition research is a time/frequency (sonogram) representation. He asked if there were others. Nawab replied that as far as he knew, time/frequency representations seemed to be the most common. He also noted that the idea of a time/frequency representation is not mathematically very well-based, and that it's a mistake to confuse frequency and the result of the Fourier transform.
Ellis contended that the reason a time-frequency representation is appropriate is that that's what the cochlea does. Adelson asked why it is that the cochlea does this; this was followed by a discussion of representations for hearing.
Bobick raised some points about the biological motivation of hearing and vision, and its relation to knowledge; there was some general discussion of this issue.
___________________________
Session 3: Appropriate Domain (Moderator: D. Rosenthal)
What is the appropriate domain for "abstract perception?" What kinds of problems of methods are common to several perceptual domains, and which are specific to a particular one? What can we hope to gain by finding out?
The basic method consists of the following steps:
The 3D data is originally extracted from the face using color-encoded structured light. Face data is transformed to a coordinate system so that the origin is at the tip of the nose, the face lies in the x-y plane, and the nose points up along the z-axis. Once the vertices are extracted, the Euclidean distance between two vectors of vertices is used to determine whether they match or not.
Prof. Abe then explained in detail the process of creating the B-spline surface from the original data, and of determining the vertices.
Finally, Prof. Abe reported the results of testing the system on a dataset of 165 faces, derived from 33 persons, with 5 data sets per person. In this trial, the identifications accuracy of the system was 99.4%. Prof. Abe mentioned that a system using 2D data achieved 80.5% accuracy.
_____________________________
Mr. Fujinaga has been working on a system to optically recognize music scores. Since the number of ways that musical symbols are drawn is very large, varies from publisher to publisher, and is constantly growing, creating a database of such symbols by hand is impractical, and the system must be endowed with the ability to learn.
Mr. Fujinaga used a nearest-neighbor method for classifying symbols. This method is advantageous in that accuracy improves over time; however, the system becomes slower, in contrast with biological systems, whose time-performance usually improves as they learn.
The reason that the system's time performance degrades as it learns more is that the system stores every symbol that it sees, and then compares every new symbol with every stored symbol. It seems reasonable to assume, however, that matching each new symbol against large numbers of previously-seen symbols is probably unnecessary, and that a subset of such symbols would be sufficient. If such a subset were found, the system's time-performance could be improved. Mr. Fujinaga applied learning techniques to this problem as well. Furthermore, Mr. Fujinaga applied learning techniques to the problem of adjusting the weights given to various features (moments) of the stored symbols. Finally, Mr. Fujinaga maintained statistics on which of several learning algorithms produced the best results, so that the system could "learn how to learn."
Following Mr. Fujinaga's presentation, there was discussion about which aspects of the system are applicable to other perception problems, in particular, to audio processing problems.
_____________________________
Prof. Kakita introduced certain perceptual effects relevant to his work, which indicate organization of perceptual information by high-level processes (in this case, those responsible for understanding speech). The first is illustrated by the following diagram:
The sounds represented in A and B in the above figure are the component formants of a vowel. When they are heard separately, they are perceived as distinct tones; when they are heard together, as in C, we hear only one tone, corresponding to the vowel.
Another effect relates the perception of /L/ and /R/ sounds, a distinction many Japanese have difficulty with.
The distinction between "la" and "ra" is contained in the third formant, as the diagram above shows. When Japanese subjects hear the third formant in isolation, they have no trouble distinguishing the two sounds; when the three formats are heard together, however, Japanese listeners perceive the two as sounding the same, indicating that higher-level, learned processing is causing the distinction to be ignored.
Prof. Kakita referred to the theory of French linguist Ferdinand de Sassure, who proposed that "auditory images" corresponding to various speech sounds are formed in the brain; the specific speech sounds in question are arbitrary and therefore vary from language to language.
Prof. Kakita also mentioned the hypothesis proposed by the Haskins Laboratory (Connecticut, USA) speech research group, which says that we have "an image of the speech organs, and of their motions," somewhere in the brain. Prof. Kakita and his colleagues have expanded this theory, and found theoretical evidence pointing to a geometric mapping of tongue shape in the cognitive levels of the brain. Prof. Kakita suggests that such a mapping may lie in the intersection of auditory and visual processing, and he is therefore interested in facts relating to this intersection.
In the final part of his presentation, Prof. Kakita presented a series of expressions, illusions, and proverbs from both Eastern and Western sources that bear on the question of how perception works.
The first was the Japanese expression "soramimi," or "illusory hearing." An example of this is imagining that one hears a baby crying when one hears the sound of water running.
Another example was that a triangle of dots will tend to be seen as a face, rather than a triangle. The third was a Japanese expression which translates as "What I saw as a ghost in the dark was nothing but the "kare-obana" weed." The fourth was a Japanese proverb, whose translation is "If you pay no attention, [lit., if your mind is not here] you can't comprehend what you see, and you can't understand what you hear." Finally, he mentioned a quote from "Le Petit Prince," by Antoine do Saint-Exupery: "On ne voit qu'avec le coeur. L'essentiel est invisible pour les yeux." ("One only sees with the mind (lit., heart); the essential is invisible to the eyes.") The common thread of these examples was that they point to the role of higher-level processing in perception.
Prof. Kakita concluded by stating that if we want to build an artificial system that behaves like humans, that it should be able to handle the cases mentioned in his examples. Prof. Kakita's general framework for emulating perception is:
__________________________
Dr. Okuno mentioned that both Prof. Matsuyama's vision group and Dr. Okuno's audition group were pursuing multiagent approaches to perception, and that they had both participated in the MACC (Multiple Agents and Cooperative Computing) conference of 1993. Dr. Okuno suggested that multiagent systems are appropriate where the desire is to produce goal-oriented behavior, situation-based behavior, adaptive behavior, openness, or robustness.
The domain to which Dr. Okuno has been applying multi-agent systems is sound source separation. In this application, sound is modeled as a set of streams, each corresponding to a separate source, and each characterized by consistency in some dimension. For each stream, a programmatic agent is generated, which tracks the stream as it develops in time. The operation of Dr. Okuno's system is represented in the following figure:
In its initial state, the sound is input into a tracer generator. Whenever the tracer generator detects a new stream (or a stream not already being tracked by any other tracer), the tracer generator produces a new tracer.
Dr. Okuno demonstrated his system on audio data consisting of voice plus sine tones at various pitches, and a mixture of two voices (a man's and a woman's). In the initial version of the system that Dr. Okuno presented, when the system was attempting to separate the voice and the sine tone, the tracer that was tracing the voice added part of the sine tone at the end. This problem was caused because the system had produced redundant tracers; it no way of knowing the two different agents were tracing the same tone. Another problem with the initial system was ghost tracers - tracers that trace tones not really present in the input.
In a more advanced version of the system, each tracer has a monitor. The monitor terminates the tracer if it is inappropriate (if, for example, it appears to be a ghost tracer) or if it is redundant. The system with monitors avoids the problems of the initial system, with corresponding improvement in the output.
_____________________________
A.
Ruttenberg, MIT Media Lab Learning and Common Sense Group
Based on his experience with this system and his review of the previous talks, Mr. Ruttenberg raised the following six issues:
Conclusions
In this section I (the editor) try to outline some of the workshop's broader themes. This is not, I should note, the only attempt to summarize the workshop: in particular, the presentations of both Dr. Ando and Mr. Ruttenberg provided broad overviews or interpretations of material presented by other researchers.
Thus the question explored in the workshop was not so much "Is there commonalty," but "when does it start, and how significant is it." While it may be the case that no one at the workshop contended that vision and audition are entirely dissimilar, some doubted that these similarities would be important in the design of practical perception systems.
Professor Adelson, however, suggested that the apparently greater clarity of the vision problem may be illusory. His research suggests that since the eye generally interprets visual images as representing a series of layers, it is a mistake to interpret a pixel as representing the light reflecting of a certain part of a certain surface. Like a microphone voltage, the brightness values of a pixels represent a complex summary of physical events for which there is no straightforward decoding.
Perceptual Grouping
I'll conclude with some somewhat subjective observations on Abstract Perception, and on future directions that I believe are worth pursuing.
In my view, many important perception problems can be viewed as problems of hierarchically grouping primitive elements. This is the case even when the ultimate goal of the system appears not to have anything directly to do with grouping.
A speech processing researcher, for example, might claim that the goal of their system is to determine which word is represented by a certain digital audio signal. But it is often the case that once one knows that the signal in question represents one and only one word, and that it doesn't represent any significant amounts of noise or other sounds, then the problem is easy. Isolated word recognition, in other words, is easy. It is continuous speech recognition that is difficult. To transform a continuous speech recognition problem into an isolated word recognition problem, we need to divide the signal into parts, and to form those parts into appropriate groups. A similar argument holds for other audio understanding tasks, such as pitch detection or non-speech sound identification.
In the visual domain, the designer of a visual navigation system might state their problem as: the detection and avoidance of obstacles. The obstacle in question, however, must first be segmented from its background. Once it is segmented, it must be organized it into meaningful parts. Both of these are grouping problems.
Many vision and audition systems and approaches are explicitly ways of grouping primitive elements:
The justifications for forming groups in various domains and at various levels will, I believe, tend to be different. In vision, a certain grouping may be preferred over others due to considerations involving the reflectance properties of some material, or the physics of how points on a spinning object move relative to each other. In audition, the justification for a grouping may be that sound elements with a common onset are likely to have originated from a common physical source, or it may be influenced by the properties of a mechanical filter in the auditory system. It's usually the case that a number of such factors are in operation simultaneously, and one of the common problems is to find good ways of integrating these factors into a single answer.
The pervasiveness of these grouping problems is what, in my view, binds perception problems in different domains together, and distinguishes them from other kinds of computations. I believe that there is justification for studying the (abstract) problem of forming groups of primitive elements into hierarchies. The results of such a study will in turn be useful for applications in specific domains.
Future Work
I hope, of course, that this argument partly convinces, or at least intrigues, some of its readers. In my view, however, the next step is not to try to make better arguments of this sort, but rather to try to build a system that implements these ideas, and I am in fact working on such a system. The system (called APE - Abstract Perception Engine) is essentially a means of forming and manipulating perceptual groupings. The input to such a system is a list of primitive elements, together with justifications for grouping them in certain ways; the output is graded lists of hierarchical groupings. The system has been partially implemented and applied to the domains of rhythm parsing and sound source separation.
Appendix: Workshop Interview Summaries
Interviewees were asked to give their general impressions of the workshop, and invited to suggest improvements.
Interviewer: David Rosenthal
I would have liked to see more tutorial material in the style of Tom Knight's symposium lecture. Vision people (such as myself) don't know much about the fundamental problems in audition. Mutual education would be beneficial. I'm not so much interested in audition for its own sake, but I'm interested in ways that research in audition might apply to vision.
Holding tutorials for non-experts might be comparable in some ways to teaching undergraduates. Teaching undergraduates is an interesting experience - on the one hand, it's frustrating, because they don't know anything. But on the other hand, it's stimulating, because you have to rethink the fundamentals and basic assumptions of your field.
I find it frustrating when people describe their work in abstract terms, as tends to happen at such gatherings. When people look for commonalties, they tend to get too abstract. I'd like to know about the guts of what they are doing. Starting out with a concrete example, and moving towards an abstraction is the way to go, in my view.
The initial literature had specific examples, which I thought were interesting. I would have liked to see more of that work [dfr's] described there. [N.B. In response to this and similar comments, I made a presentation of my own work the next day - dfr.]
Is there any particular reason why there weren't any women at this conference?
I was glad to be able to find out more about sound. It's difficult to find researchers interested in the general problem of processing sound; most of this work is centered on speech processing.
It seems to me that Shimon Ullman's visual routines work is relevant. [Bobick then explained a few of the basics of Ullman's work, which I haven't included here.]
Audio processing people need to realize the importance of mid-level representations. [What can vision people tell them about that?] Not much! We don't have good mid-level representations either!
I've been thinking about who is "missing," that is, people that it might be a good idea to have at such a workshop. Eric Grimson and Shimon Ullman come to mind. [N.B. These are both vision professors in the MIT AI Lab.] Ullman's ideas are basically abstract perception ideas.
You might also think about having a "dyed-in-the-wool" speech understanding person around, someone interested in Hidden Markov techniques, for example. "We" - i.e., the cognitive AI people of the kind represented at this workshop - have fundamental disagreements with these people, but it would be a mistake to ignore them, and they might provide a valuable perspective. The kinds of statistical techniques they are pursuing will eventually "top out" - they'll reach a stage where they can't be improved any further. But for now, they perform better than the reasoning techniques pursued by people here. We need to be able to make our case convincingly.
I've been having a very pleasant and stimulating time. The staff members have done a very good job. But it's difficult to comment on any concrete accomplishments of the workshop. More email discussion might have helped make the discussion more substantive. The effort was made, so perhaps the participants are at fault. I don't really know how to make this happen.
Abstract Perception doesn't relate directly to my research interests. I'm more inclined to look at specific domains. The symposium talks were very interesting, but today's talks concerned mainly personal research. People seemed a bit at a loss. I'm not sure how motivated you are in presenting Abstract Perception.
The study of multimedia systems was under-represented. This could have lead to a common basis for discussion.
The younger Japanese researchers only talk about their own work, and are afraid to speculate. [What would you suggest to correct this?] I don't know, it's a very difficult problem. Several years I ran a workshop where tried to introduce young researchers to problems of computer vision, we encountered the same problem. I am a somewhat exceptional Japanese researcher in this respect, since I say what I think; this is not common. Having a clear opinion is considered dangerous.
This workshop is important and there should be more of them. It should be made clear what benefit IMRF derives from such a workshop.
It was interesting to me because there are few people around me who are doing research on vision, and I have never thought about the relationship of vision and audition. I learned that the number of people that have thought about this is small.
What was good was that I learned a lot of new things and I met people from a broad range of areas that I could talk to. As for improvements, it should be made clearer what the role of the presenter is. It wasn't clear how people should use their ten minutes. In the morning, people didn't stick to the ten-minute limit. The format wasn't clear. Each presentation was interesting but we couldn't pursue it through discussion. Perhaps a format of presentations only in the morning and discussion only in the afternoon would be a good idea. It might have been nice to have another day.
[Kashino - David's policy was to avoid putting a lot of constraints on presenters.]
But this policy was not announced. When I talked to Horiguchi-sensei, he didn't know about this policy. I understand the spirit of such a policy. I think that some of the presenters did not, however, and this was a problem. Some people talked about their own work, while others talked more generally. In today's session, the Japanese researchers were very intent on presenting their own work. A clearer statement of policy beforehand might have helped.
It's also difficult for the Japanese researchers to participate in English discussions. If you want Japanese researchers to talk about things outside of their specialties, they'll need more preparation that the American researchers. Japanese participants had trouble with the term "domain;" we weren't sure if this term had a special meaning in, for example, vision research.
[Kashino - each of vision and audition is a domain]
Before the interview I was talking with Fujinaga-san, and he was saying that each of the five senses should be considered a domain.
I would have liked to pursue questions about the "exchangeable cortex" ferret experiments further. At what level is this exchange made? But there wasn't enough time.
The talk about spatial mapping in auditory research is reminiscent of some of the things that come up in speech modeling, where according to one theory, we produce speech using an internal model of the mouth.
[Kashino - you mean an internal model of the mechanics of the mouth.]
The mechanics as well.
[Kakita-sensei has the idea that in order to produce speech we need to have a "spatio-mechanical" (my term) internal model of the mouth. Such an ability is interesting from the point of view of Abstract Perception, because such model is strictly neither visual nor auditory, but rather interestingly situated between them.]
As far as improvements go: it seems that vision and audition people tend to stick to their own areas. The discussion may not be going exactly as Rosenthal intended.
[Kashino - what did you understand from the term "abstract perception?"]
I understood this to refer to algorithms and methods that might be common to vision and audition systems.
The topic of the workshop is broad and abstract. Generally we focus on a more concrete theme. Also, there are language difficulties. Each participant should talk more about the specifics of their research problems. Researchers in vision are not familiar with the details of audition research, and vice versa. Vision has a framework for thinking about intelligent processes, while audition still seems to be involved with the signal processing level. I would like to know more about the audio processing systems discussed at this workshop. Abstract Perception itself is pointed in a good direction, so I expect that discussion will be more focused in the future.
[Issues properly raised?]
I will host an international conference on integration in the human brain this year. We're calling it "information integration," but we haven't settled on a clear definition for this term yet. We'll discuss integration of vision, audition, action, and language recognition. This issue is similar to that of this workshop. Presentation should be based on more concrete data - I know that this is difficult.
[Any ideas from the workshop that you plan to pursue further?]
I learned from the knowledge session that a number of other researchers use the energy-minimization method that I am interested in. I will talk to Prof. Matsuyama about "mutual information quantity."
[Other comments?]
Although the workshop was hosted by a Japanese institute, U. S. researchers played leading roles in the discussion. What does IMRF expect from this workshop? I would suggest inviting more Japanese researchers, and providing the opportunity for them to learn from foreign experts in certain specialties. In the past few years, many advanced foreign researchers are visiting Japan, despite Japan's current economic problems. The workshop must be more characterized more clearly to be significantly successful. I would recommend having brainstorming discussions with a few foreign researchers present.
[Would you participate in another Abstract Perception Workshop if there is one?]
Yes, if there is overlap with my interests.
____________________________
I appreciate it. Japanese generally believe that auditory signal processing research is advanced in Japan, but this is not the case. Only statistically-oriented speech recognition is advanced. There are only a few researchers doing speech perception or audiology. I'm looking for potential colleagues in this field, and at this workshop, I met people that I could talk with, who provided me with interesting suggestions.
[How, in your view, could the workshop be improved?]
If we had another day, we would have had time to make 20 minute presentations. In 10 minutes, a scarcely have time to raise issues of interest to me, or to present my own research results. Perhaps if abstracts were distributed, and sessions were broken into more specific areas, we could participate more in the discussion.
[Issues properly raised?]
I am interested in signal processing, domain specific issues. But I understand the importance of integrating perception into a common approach. Through this process, the system should maintain some domain specificity. I think that there are domain-specific issues that still elude researchers in perceptual computing. These domain-specific questions should be pursued. Low-level, sub-symbolic processes are domain-specific.
[Any ideas from the workshop that you plan to pursue further?]
As I mentioned, I have few research colleagues in my field, and I will bring back reports from the presentations and discussion made here to my lab.
Most Japanese research follows leads established in the U. S.. I'm not always in favor of this situation, but here, it may be warranted. I basically believe that common perceptual processes are those responsible for integration. But we don't yet know what kind of information is shared, and how the process is initiated. The low-level processing issues must be made clear before we can properly address the abstract issues.
The discussion led by Nawab - a free discussion with a moderator, rather than a series of speakers - worked well.
The general organization was very good; organizing an international workshop is not an easy thing to do. Holding the workshop in a ryokan was a very good idea. For the American participants there was a another side to the workshop, that of being in Japan, being in the ryokan, the onsen, and so on.
As for the relevance of the workshop to my research interests, it is definitely relevant to my interests; otherwise I wouldn't have come. Abstract perception can aid research in specific perceptual areas. Applying techniques developed in the context of another modality always helps one understand [perceptual modeling problems] better. General principles are useful to everyone. One of the nice things about abstract perception as a field is that it forces one to look at general principles, rather than specific techniques in vision or audition. Actually, we didn't get into that very much; the discussion was more on the level of specific techniques from vision that might be useful in audition, for example. If you were to write a book on the subject, it would be good to try to create such general principles.
When you do manage to form these abstract principles, you often find that it's been studied by mathematicians or information theorists, which makes a whole new body of knowledge available that you can apply to your problem.
As far applying ideas from the workshop: I'll be working on problems in visual attention this year, and I think that work on attention from the auditory domain is relevant. What will I take back? Well I'll definitely be opening a book on audition!
As far as the panels go; people mainly present their own work, and don't always get to "crossover," that is, application of ideas from one domain to another. There are other modes of perception that aren't getting discussed, such as tactile perception and taste.
Rhythm - Dave's interest - is auditory perception on a larger scale, much as poetry, sentence understanding, or understanding ballet scenes might be considered speech understanding or vision on a larger scale. There's a general issue of "level" - not just level of abstraction, but level of time-scale, for example.
Maybe we need an "Abstract Perception Conference" - that is, a gathering where people actually have to talk about abstract perception. Psychologists do essentially talk about abstract perception - but we also need to have computational models.
The word "domain" apparently doesn't translate into Japanese with the right connotations. In general, there is a language problem, which inhibits Japanese presenters somewhat. Of course, if you go to a European conference, the main language will also be English. I wanted to hear more from the Japanese researchers. On the other hand, maybe it's not entirely a language problem; maybe if it was all in Japanese, the Japanese participants would have been quieter anyway.
Abstract perception is relevant to my interests in that I want to get input and ideas from as many sources as possible. If I was only interested in finding out what I could from vision people, I could just go to a vision conference. But the issue that I'm interested in - perceptual learning, which I maintain is essentially an abstract perception idea - probably wouldn't get discussed there.
In general I think that the one-on-one conversations were very valuable for me. It will probably take me a little while to digest what I learn here.
When we look at computer implementations of audition and vision, we can examine the issues of similarity more closely. For example, today, Prof. Matsuyama presented a multi-agent approach to vision. My work involves a multi-agent approach to audition. In this case, we can see some commonalty.
I also want to point out the differences between hearing vs. listening, or seeing vs. looking. When we attempt to design real-world systems, we have to consider issues of attention.
As for what I would have done differently: I would have had a session on differences between audition and vision. This might very well lead to a clearer picture of what the commonalties are; if we only look for commonalties, we might end up saying nothing of importance.
Another idea: everyone gives a presentation entitled "Why My Area Is Difficult." This might cause some commonalties to become evident. This might also help to educate conferees about specialties that they don't know about.
As far a applicability to my own research: As I mentioned, I think that attention is an important issue - and it is also a domain-independent issue. Attention requires the use of several modalities; we cannot consider it to be solely a vision problem or and audition problem.
As far as taking things back from the workshop: I will have to digest what I've picked up here; I won't be applying it immediately. Workshop ideas will probably indirectly affect my work. Most important is the networking aspect - getting acquainted with various people, opening up new channels of communication. Then even if I don't understand something, I at least have a pointer to a person.
I would have liked to see more time for individual discussions. A certain amount of time has to be spent "breaking the ice." You need to know who people are in order to communicate with them effectively.
As far as the panels go: there should be a clear distinction between discussion and presentation. This is not the place to simply present your own work on vision or audition. Participants should be more "radical," willing to present possibly controversial ideas. This sort of thing worked better in the afternoon session.
The work discussed here is highly relevant to my interests. I have gained an understanding that there are in fact common problems between audition and vision, though these may not extend to implementation details.
As far as the over-running problem goes, I'd suggest that each participant be limited to one transparency (for a ten-minute talk) so that the presentation is forced to be simple and clear for the audience.
People need an opportunity to describe their personal work, however. For that purpose I'd suggest having poster session - that is, each person sets up a booth, where they explain their own work. People could then look into each presentation at their own pace. I've seen this format used successfully at other workshops - a poster, followed by one transparency per presentation. Panel discussions should be very actively moderated to ensure that as many people as possible are involved.
I believe strongly in the importance and value of domain-independent approaches to perception. By developing a system that can be applied to both vision and audition, the design is forced to be clean and clearly understood.
At the moment, solutions in both vision and audition tend to be very messy and confounded between different kinds of processing. domain-independence is a discipline to focus the mind in distinguishing different parts of the system.
I think that the communication between the vision and audition people was very valuable and potentially beneficial to both. Their subtly different approaches to essentially similar problems can be very instructive and inspiring. I wouldn't specify any explicit ideas that I'll be taking home from the conference, but I would emphasize that wider familiarity with research in various areas of perceptual modeling is extremely important and influential in guiding my own work.
Appendix B: Contact Information for Participants
Touru Abe, Japan Advanced Institute for Science and Technology,
beto@jaist.ac.jp
Ted Adelson, MIT Media Lab Perceptual Computing Group, adelson@media.mit.edu
Subutai Ahmad, Interval Research, ahmad@interval.com
Masato Akagi, Japan Advanced Institute for Science and Technology, akagi@jaist.ac.jp
Hiroshi Ando, ATR Human Information Processing Laboratories, ando@hip.atr.co.jp
Aaron Bobick, MIT Media Lab Perceptual Computing Group, bobick@media.mit.edu
Dan Ellis, MIT Media Lab Perceptual Computing Group, dpwe@media.mit.edu
Ichiro Fujinaga, McGill University Faculty of Music, ich@sound.music.mcgill.ca
Yuzuru Hiraga, University of Library and Information Science, hiraga@ulis.ac.jp
Susumu Horiguchi, Japan Advanced Institute for Science and Technology, hori@jaist.ac.jp
Toshio Inui, Kyoto University, inui@kuis.kyoto-u.ac.jp
Yuki Kakita, Kanazawa Institute of Technology, kakita@manage.kanazawa-it.ac.jp
Kunio Kashino, Tanaka Laboratory, University of Tokyo, kashino@mtl.t.u-tokyo.ac.jp
Frank Klassner, University of Massachusetts, Distributed Problem Solving Laboratory, klassner@cs.umass.edu
Tom Knight, MIT AI Lab, tk@ai.mit.edu
Takashi Matsuyama, Okayama University, tm@chino.it.okayama-u.ac.jp
Hamid Nawab, Boston University, hamid@engc.bu.edu
Hiroshi Okuno, NTT Basic Research Labs, okuno@ntt-20.ntt.jp
David Rosenthal, International Media Research Foundation, dfr@media.mit.edu
Alan Ruttenberg, MIT Media Lab Learning and Common Sense Group, alanr@media.mit.edu
Yutaka Sakaguchi, Department of Mathematical Engineering and Information Physics, University of Tokyo, sak@bcl.t.u-tokyo.ac.jp
Shigeyoshi Shimotsuji, Toshiba Information and Communication System Laboratories, gajira@isl.rdc.toshiba.co.jp