Keith D. Martin
Machine Listening Group
MIT Media Lab
Submitted to the department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical Engineering and Computer Science at the Massachusetts Institute of Technology, June, 1999.
(C) Massachusetts Institute of Technology, 1999. All Rights Reserved.
This thesis proposes a theory of sound-source recognition, casting recognition as a process of gathering information to enable the listener to make inferences about objects in the environment or to predict their behavior. In order to explore the process, attention is restricted to isolated sounds produced by a small class of sound sources, the non-percussive orchestral musical instruments. Previous research on the perception and production of orchestral instrument sounds is reviewed from a vantage point based on the excitation and resonance structure of the sound-production process, revealing a set of perceptually salient acoustic features.
A computer model of the recognition process is developed that is capable of "listening" to a recording of a musical instrument and classifying the instrument as one of 25 possibilities. The model is based on current models of signal processing in the human auditory system. It explicitly extracts salient acoustic features and uses a novel improvisational taxonomic architecture (based on simple statistical pattern-recognition techniques) to classify the sound source. The performance of the model is compared directly to that of skilled human listeners, using both isolated musical tones and excerpts from compact disc recordings as test stimuli. The computer model's performance is robust with regard to the variations of reverberation and ambient noise (although not with regard to competing sound sources) in commercial compact disc recordings, and the system performs better than three out of fourteen skilled human listeners on a forced-choice classification task.
This work has implications for research in musical timbre, automatic media annotation, human talker identification, and computational auditory scene analysis.
Thesis supervisor: Barry L. Vercoe
Title: Professor of Media Arts and Sciences