This paper origninally appeared in Proceedings of 1992 Conference, American
Voice I/O Society, Sep. 1992, pp. 169-177.
Speech Research Group, MIT Media Lab
20 Ames Street, E15-353, Cambridge MA 02139
There are a variety of techniques for time-compressing speech that have been developed over the last four decades. This paper consists of a review of the literature on methods for time-compressing speech, including related perceptual studies of intelligibility and comprehension.
The primary motivation for time-compressed speech is for reducing the
time needed for a user to listen to a message--to increase the communication
capacity of the ear. A secondary motivation is that of data reduction--to
save storage space and transmission bandwidth for speech messages.
Time-compressed speech can be used in a variety of application areas including teaching, aids to the disabled, and human-computer interfaces. Studies have indicated that listening to teaching materials twice that have been speeded up by a factor of two is more effective than listening to them once at normal speed [Sti69]. Time-compressed speech has been used to speed up message presentation in voice mail systems [Hej90, Max80], and in aids for the blind. Speech can be slowed for learning languages, or for the hearing impaired. Time compression techniques have also been used in speech recognition systems to time normalize input utterances to a standard length [Mal79].
While the utility of time compressing recordings is generally recognized, surprisingly, its use has not become pervasive. Rippey performed an informal study on users of a time-compression tape player installed in a university library. Virtually all the comments were positive, and the librarians reported that the speech compressor was the most popular piece of equipment in the library [Rip75].
The lack of commercial acceptance of time-compressed speech is partly because of the cost of compression devices and the quality of the reproduced speech, but is also attributable to the lack of user control. Traditionally, recordings were reproduced at fixed compression ratios where `` the rate of listening is completely paced by the recording and is not controllable by the listener. Consequently, the listener cannot scan or skip sections of the recording in the same manner as visually scanning printed text, nor can the listener slow down difficult-to-understand portions of the recording'' [Por78].
Powerful computer workstations with speech input/output capabilities make high quality time-compressed speech readily available. It is now practical to integrate speech time-compression techniques into interactive voice applications, and the software infrastructure of workstations, portable, and hand-held computers to provide user interfaces for high-speed listening.
There are three variables that can be studied in compressed speech [Duk74a]:
Other related factors come into play in the context of integrating speech into computer workstations or hand-held computers:
There are several ways to express the amount of compression produced by the techniques described in this document. The most common figure in the literature is the compression percentage (footnote-1). A compression of 50% corresponds to a factor of 2 increase in speed (requiring half the time to play). A compression of 20% corresponds to a factor of 5 increase in speed. These numbers are most easily thought of as a percent reduction in time or data.
Time-compressed speech is also referred to as accelerated, compressed, time-scale modified, sped-up, rate-converted, or time-altered speech (footnote-2). There are a variety of techniques for changing the playback speed of speech--a survey of these methods are described briefly in the following sections. Note that these techniques are primarily concerned with reproducing the entire recording, not scanning portions of the signal. Most of these methods also work for slowing speech down, but this is not of primary interest. Much of the research summarized here was performed between the mid 1950's and the mid 1970's, often in the context of accelerated teaching techniques, or aids for the blind.
The normal English speaking rate is between 130-200 words per minute (wpm). When speaking fast, a talker unintentionally changes relative attributes of his speech such as pause durations, consonant-vowel duration, etc. Talkers can only compress their speech to about 70% because of physiological limitations (footnote-3) [BM76].
Speed changing is analogous to playing a tape recorder at a faster (or slower) speed. This method can be replicated digitally by changing the sampling rate during the playback of a sound. These techniques are undesirable since they produce a frequency shift proportional to the change in playback speed, causing a decrease in intelligibility.
With purely synthetic speech it is possible to generate speech at a variety of word rates. Current text-to-speech synthesizers can produce speech at rates up to 550 wpm. This is typically done by selectively reducing the phoneme and silence durations. This technique is powerful, particularly in aids for the disabled, but is not relevant to recorded speech.
Vocoders that extract pitch and voicing information can be used to time-compress speech. Most vocoding efforts, however, have focused on bandwidth reduction rather than on naturalness and high speech quality. The phase vocoder, described in section 7.2, is an exception.
A variety of techniques can be used to find silences (pauses) in speech
and remove them. The resulting speech is ``natural, but many people find
it exhausting to listen to because the speaker never pauses for breath''
[Neu78]. The simplest methods involve
the use of energy or average magnitude measurements combined with time thresholds;
other metrics include zero-crossing rate measurements, LPC parameters, etc.
A variety of speech/silence detection techniques are reviewed in detail
Maxemchuk [Max80] used 62.5ms frames of speech corresponding to disk blocks (512 bytes of 8kHz, 8-bit -law data). For computational efficiency, only a pseudo-random sample of 32 out of every 512 values were looked at to determine low energy portions of the signal. Several successive frames had to be above or below a threshold in order for a silence or speech determination to be made.
TASI (Time Assigned Speech Interpolation) is used to approximately double the capacity of existing transoceanic telephone cables [MS62]. Talkers are assigned to a specific channel while they are speaking; the channel is then freed during silence intervals. During busy hours, a talker will be assigned to a different channel about every other ``talkspurt''. The TASI speech detector is necessarily a real-time device, and must be sensitive enough to prevent clipping of the first syllable. However, if it is too sensitive, the detector will trigger on noise and the system will operate inefficiently. The turn-on time for the TASI speech detector is 5ms, while the release time is 240ms. The newer DSI (Digital Speech Interpolation) technique is similar, but works entirely in the digital domain. Note that Maxemchuk's system was primarily concerned with reducing the time a listener needed to hear a message and minimizing storage requirements. DSI/TASI are concerned with conserving network bandwidth.
More sophisticated energy and time heuristics ([LRRW81, RS75], summarized in [O'S87]) are used in endpoint detection for isolated word recognition--to ensure that words are not inadvertently clipped. The algorithms for such techniques are more complex than those mentioned above, and such fine-grained accuracy is probably not necessarily for compressed speech or speech scanning.
The basis of much of the research in time-compressed speech was originated in 1950 by Miller and Licklider with experiments that demonstrated the temporal redundancy of speech. The motivation for this work was to increase channel capacity by switching speech on and off at regular intervals so the channel could be used for another transmission (see figures 1 and 2B). It was established that if interruptions were made at frequent intervals, large portions of a message could be deleted without affecting intelligibility [ML50].
Figure 1: Sampling terminology [FK57]
Other researchers concluded that listening time could be saved by abutting the interrupted speech segments. This was first done by Garvey who manually spliced audio tape segments together [Gar53a, Gar53b], then by Fairbanks with a modified tape recorder with four rotating pickup heads (footnote-4) [FEJ54]. The bulk of literature involving the intelligibility and comprehension of time-compressed speech is based on such electromechanical tape recorders.
In the Fairbanks, or sampling, technique, segments of the speech signal are alternatively discarded and retained (figure 2C). This has traditionally been done isochronously--at constant sampling intervals without regard to the contents of the signal. Implementing such an algorithm on a general purpose processor is straightforward.
Figure 2: Sampling techniques
Word intelligibility decreases if is too large or too small. Portnoff [Por81] notes that the duration of each sampling interval should be at least as long as one pitch period (e.g., > 15ms), but should also be shorter than the length of a phoneme. Although computationally simple, such time-domain techniques introduce discontinuities at the interval boundaries that are perceived as ``burbling'' distortion and general signal degradation.
It has been noted that some form of windowing function or digital smoothing at the junctions of the abutted segments will improve the audio quality. The ``braided-speech'' method continually blended adjacent segments with linear fades, rather than abutting segments [Que74].
Lee describes two digital electronic implementations of the sampling technique [Lee72], and discusses the problems of discontinuities when segments are simply abutted together.
One interesting variant of the sampling method (figure 2D) is achieved by playing the standard sampled signal to one ear and the ``discarded'' material to the other ear (footnote-5) ([Sco67] summarized in [Orr71]). Under this dichotic (footnote-6) condition, intelligibility and comprehension increase. Most subjects also prefer this technique to a diotic presentation of a conventionally sampled signal. Listeners initially reported a switching of attention between ears, but they quickly adjusted to this unusual sensation. Note that for compression ratios up to 50%, the two signals to the ears contain common information. For compressions greater than 50% some information is necessarily lost.
The basic sampling technique periodically removes pieces of the speech
waveform without regard to whether it contains any redundant speech information.
David and McDonald demonstrated a bandwidth reduction technique in 1956
that selectively removed (redundant) pitch periods from speech signals [DM56]. Scott applied the same ideas to
time compression, setting the sampling and discard intervals to be synchronous
with the pitch periods of the speech. Discontinuities in the time compressed
signal were reduced, and intelligibility increased [SG72].
Neuburg developed a similar technique in which intervals equal to the pitch
period were discarded (but not synchronous with the pitch pulses). Finding
the pitch pulses is hard, yet estimating the pitch period is much easier,
even in noisy speech [Neu78].
Since frequency-domain properties are expensive to compute, it has been suggested that easy-to-extract time-domain features can be used to segment speech into transitional and sustained segments. For example, simple amplitude and zero crossing measurements for 10ms frames can be used to group adjacent frames for similarity--redundant frames can then be selectively removed [Que74]. Toong [Too74] selectively deleted 50-90% of vowels, up to 50% of consonants and fricatives, and up to 100% of silence. However, he found that complete elimination of silences was undesirable (see also section 9.4).
The synchronized overlap add method (SOLA) first described by Roucos
and Wilgus [RW85] has recently become popular
in computer-based systems. It is a fast non-iterative optimization of a
fourier-based algorithm described in [GL84].
``Of all time scale modification methods proposed, SOLA appears to be the
simplest computationally, and therefore most appropriate for real-time applications''
[WRW89]. Conceptually, the SOLA method consists
of shifting the beginning of a new speech segment over the end of the preceding
segment to find the point of highest cross-correlation. Once this point
is found, the frames are overlapped and averaged together, as in the sampling
method. This technique provides a locally optimal match between successive
the frames in this manner tends to preserve the time-dependent pitch, magnitude,
and phase of a signal. The shifts do not accumulate since the target position
of a window is independent of any previous shifts [Hej90].
The SOLA method is simple and effective as it does not require pitch extraction,
frequency-domain calculations, phase unwrapping, and is non-iterative [ME86]. The SOLA technique can be considered a type
of selective sampling that effectively removes redundant pitch periods.
A windowing function can be used with this technique to smooth between segments, producing significantly less artifacts than traditional sampling techniques. Makhoul used both linear and raised cosine functions for averaging windows, and found the simpler linear function sufficient [ME86]. The SOLA algorithm is robust in the presence of correlated or uncorrelated noise, and can improve the signal to noise ratio of noisy speech [WW88, WRW89].
Several improvements to the SOLA method have been suggested that offer improved computational efficiency, or increased robustness in compression/decompression applications [ME86, WW88, WRW89, Har90, Hej90]. Hejna, in particular, provides a detailed description of SOLA, including an analysis of the interactions of various parameters used in the algorithm.
In addition to the frequency domain methods outlined in this section, there are a variety of other frequency-based techniques that can be used for time compressing speech (e.g., [MQ86, QM86]).
Harmonic compression involves the use of a fine-tuned (typically analog)
filter bank. The energy outputs of the filters are used to drive filters
at half the frequency of the original. A tape of the output of this system
is then played on a tape recorder at twice normal speed. The compression
ratio of this frequency domain technique was fixed, and was being developed
before the time when it was practical for digital computers to be used for
Malah describes time-domain harmonic scaling which requires pitch estimation, is pitch synchronous, and can only accommodate certain compression ratios [Mal79, Lim83].
A vocoder that maintains phase [Dol86] can be used for time-compression. A phase vocoder can be interpreted as a filterbank and thus is similar to the harmonic compressor. A phase vocoder is, however, significantly more complex because calculations are done in the frequency domain, and the phase of the original signal must be reconstructed.
Portnoff [Por81] developed a system for time-scale modification of speech based on short-time Fourier analysis. His system provided high quality compression of up to 33% while retaining the natural quality and speaker-dependent features of the speech. The resulting signals were free from artifacts such as glitches, burbles, and reverberations typically found in time-domain methods of compression such as sampling.
Phase vocoding techniques are more accurate than time domain techniques, but are an order of magnitude more computationally complex because Fourier analysis is required. The phase vocoder is particularly good at slowing speech down to hear features that cannot be heard at normal speed--such features are typically lost using time domain techniques. Dolson says ``a number of time-domain procedures can be employed at substantially less computational expense. But from a standpoint of fidelity (i.e., the relative absence of objectionable artifacts), the phase vocoder is by far the most desirable.''
The time-compression techniques described above can be mixed and matched in a variety of ways. Such combined methods can provide a variety of signal characteristics and a range of compression ratios.
Maxemchuk [Max80] found that eliminating every
other non-silent block (1/16th second) produced ``extremely choppy and virtually
unintelligible playback.'' Eliminating intervals with less energy than the
short-term average (and no more than one in a row), produced distorted but
intelligible speech. This technique produced compressions of 33 to 50 percent.
Maxemchuk says that this technique `` has the characteristic that those
words which the speaker considered to be most important and spoke louder
were virtually undistorted, whereas those words that were spoken softly
are shortened. After a few seconds of listening to to this type of speech,
listeners appear to be able to infer the distored words and obtain the meaning
of the message.'' He believes such a technique would be ``useful for users
of a message system to scan a large number of messages and determine which
they wish to listen to more carefully or for users of a dictation system
to scan a long document to determine the areas they wish to edit.''
Silence compression and sampling can be combined in several ways. Silences can first be removed from a signal that is then sampled. Alternatively, the output of a silence detector can be used to set boundaries for sampling, producing a selective sampling technique. Note that using silences to find discard intervals eliminates the need for a windowing function to smooth (de-glitch) the sound at the boundaries of the sampled intervals.
On the surface it would appear that removing silences and time-compressing
speech using a technique such as the overlap-add method would be linearly
independent, and could thus be performed in either order. In practice there
are some minor differences, because the SOLA algorithm makes assumptions
about the properties of the speech signal. The Speech Research Group has
informally found a slight improvement in speech quality by applying the
SOLA algorithm before removing silences. Note that timing parameters must
be modified under these conditions. For example with speech compressed 50%,
the silence removal timing thresholds must also be cut in half.
This combined technique is effective, and can produce a fast and dense speech stream. Note that silence periods can be selectively retained or shortened, rather than simply removed.
A sampled signal compressed by 50% can be presented dichotically so that exactly half the signal is presented to one ear, while the remainder of the signal is presented to the other ear. Generating such a lossless dichotic presentation is difficult with the SOLA method because the segments of speech are shifted relative to one another to find the point of maximum similarity. However, by choosing two starting points in the speech data carefully (based on the parameters used in the SOLA algorithm), it is possible to maximize the difference between the signals presented to the two ears. We have informally found this technique to be effective since it combines the high quality sounds produced with the SOLA algorithm with the binaural effect of the dichotic presentation.
There has been a significant amount of perceptual work performed in the areas of intelligibility and comprehension of time-compressed speech. Much of this research has been summarized in [BM76], [FS69], and [Fou71].
``Intelligibility'' usually refers to the ability to identify isolated
words. Depending on the type of experiment, such words may either be selected
from a closed set, or written down (or shadowed) by the subject from an
open-ended set. ``Comprehension'' refers to the understanding of the content
of the material. This is usually tested by asking questions about a passage
of recorded material.
In general, intelligibility is more resistant to degradation as a function of time-compression than is comprehension [Ger74]. Early studies showed that single well-learned phonetically balanced words could remain intelligible with a 10-15% compression (10 times normal speed), while connected speech remains comprehensible to a 50% compression (twice normal speed).
There are some practical limitations on the maximum amount that a speech
signal can be compressed. Portnoff notes that arbitrarily high compression
ratios are not physically reasonable. He considers, for example, a voiced
phoneme containing four pitch periods. Greater than 25% compression reduces
this phoneme to less than one pitch period, destroying its periodic character.
Thus, he expects high compression ratios to produce speech with a rough
quality and low intelligibility [Por81].
The ``dichotic advantage'' (section 6.2) is maintained for compression ratios of up to 33%. For discard intervals between 40-70ms, dichotic intelligibility was consistently higher than diotic intelligibility [GW77]. A dichotic discard interval of 40-50ms was found to have the highest intelligibility (40ms was described as the ``optimum interval'' in another study [Ger74]. Earlier studies suggest that a shorter interval of 18-25ms may be better for diotic speech [BM76]).
Gerber showed that 50% compression presented diotically was significantly better than 25% compression presented dichotically, even though the information quantity of the presentations was the same. These and other data provide conclusive evidence that 25% compression is too fast for the information to be processed by the auditory system. The loss of intelligibility, however, is not due to the loss of information because of the compression process [Ger74].
Foulke [FS69] reported that comprehension declines slowly up to a word rate of 275wpm, but more rapidly beyond that point. The decline in comprehension was not attributable to intelligibility alone, but was related to a processing overload of short-term memory. Recent experiments with French have shown that intelligibility and comprehension do not significantly decay until a high rate (300wpm) is reached [RSLM88].
Note that in much of the literature the limiting factor that is often cited is word rate, not compression ratios. The compression required to boost the speech rate to 275 words per minute is both talker and context dependent (e.g., read speech is typically faster than spontaneous speech).
Foulke and Sticht permitted sighted college students to select a preferred degree of time-compression for speech spoken at an original rate of 175wpm. The mean preferred compression was 82% corresponding to a word rate of 212wpm. For blind subjects it was observed that 64-75% time-compression and word rates of 236-275 words per minute were preferred. These data suggest that blind subjects will trade increased effort in listening to speech for a greater information rate and time savings [ZDS68].
Comprehension of interrupted speech (as in [ML50]) was good, probably because the temporal duration of the original speech signal was preserved, providing ample time for subjects to attempt to process each word [HLLB86]. Compression necessitates that each portion of speech be perceived in less time than normal. However, each unit of speech is presented in a less redundant context, so that more time per unit is required. Based on the large body of work in compressed speech, Heiman suggests that 50% compression removes virtually all redundant information. With greater than 50% compression, critical non-redundant information is also lost. They conclude that the compression ratio rather than word rate is the crucial parameter, because greater than 50% compression presents too little of the signal in too little time for a sufficient number of words to be accurately perceived. They believe that the 275 wpm rate is of little significance, but that compression and its underlying temporal interruptions decrease word intelligibility that results in decreased comprehension.
As with other cognitive activities, such as listening to synthetic speech,
exposure to time-compressed speech increases both intelligibility and comprehension.
There is a novelty in listening to time-compressed speech for the first
time that is quickly overcome with experience.
Even naive listeners can tolerate compressions of up to 50%, and with 8-10 hours of training, substantially higher speeds are possible [OFW65]. Orr hypothesizes that ``the review of previously presented material could be more efficiently accomplished by means of compressed speech; the entire lecture, complete with the instructor's intonation and emphasis might be re-presented at high speed as a review.'' Voor found that practice increased comprehension of rapid speech, and that adaptation time was short (minutes rather than hours) [VM65].
Beasley reports on an informal basis that following a 30 minute or so exposure to compressed speech, listeners become uncomfortable if they are forced to return to the normal rate of presentation [BM76]. He also reports on a controlled experiment extending over a six week period that found subjects' listening rate preference shifted to faster rates after exposure to compressed speech.
It may not be desirable to completely remove silences, as they often
provide important semantic and syntactic cues. With normal prosody, intelligibility
was higher for periodic segmentation (inserting silences after every eighth
word (footnote-8)) than for
syntactic segmentation (inserting silences after major clause and sentence
Wingfield says that ``time restoration, especially at high compression ratios,
will facilitate intelligibility primarily to the extent that these presumed
processing intervals coincide with the linguistic structure of the speech
In another experiment, subjects were allowed to stop time-compressed recordings at any point, and were instructed to repeat what they had heard [WN80]. It was found that the average reduction in selected segment duration was almost exactly proportional to the increase in the speech rate. For example, the mean segment duration for the normal speech was 3s, while the chosen segment duration of speech compressed 60% was 1.7s. Wingfield found that ``while time and/or capacity must clearly exist as limiting factors to a theoretical maximum segment size which could be held [in short-term memory] for analysis, speech content as defined by syntactic structure, is a better predictor of subjects' segmentation intervals than either elapsed time or simple number of words per segment. This latter finding is robust, with the listeners' relative use of the [syntactic] boundaries remaining virtually unaffected by increasing speech rate.''
In the perception of normal speech, it has been found that pauses exerted a considerable effect on the speed and accuracy with which sentences were recalled, particularly under conditions of cognitive complexity [Rei80]. Pauses, however, are only useful when they occur between clauses within sentences--pauses within clauses are disrupting. When a 330ms pause was inserted ungrammatically, response time for a particular task was increased by 2s. Pauses suggest the boundaries of material to be analyzed, and provide vital cognitive processing time.
Maxemchuk found that eliminating silent intervals decreased playback time of recorded speech with compression ratios of 50 to 75 percent depending on the talker and material. In his system a 1/8 second pause is inserted whenever a pause greater or equal to 1 second occurred in a message. This appeared to be sufficient to prevent different ideas or sentences in the recorded document from running together. This type of rate increase does not affect the intelligibility of individual words within the active speech regions [Max80].
Studies of pauses in speech also consider the duration of the ``non-pause'' or ``speech unit''. In one study of spontaneous speech, the mean speech unit was 2.3 seconds. Minimum pause durations typically considered in the literature range from 50-800ms, with the majority in the 250-500ms region. As the minimum pause duration increases, the mean speech unit length increases (e.g, for pauses of 200, 400, 600, and 800ms, the corresponding speech unit lengths were 1.15, 1.79, 2.50, and 3.52s respectively). In another study, it was found that inter-phrase pauses were longer and occurred less frequently than intra-phrase pauses (data from several articles summarized in [Agn74]).
Hesitation pauses are not under the conscious control of the talker, and average 200-250ms. Juncture pauses are under talker control, and average 500-1000ms. Several studies have shown that breath stops in oral reading are about 400ms. In a study of the durational aspects of speech, it was found that the silence and speech unit durations were longer for spontaneous speech than for read speech, and that the overall word rate was slower. The largest changes occurred in the durations of the silence intervals. The greater number of long silence intervals were assumed to reflect the tendency for speakers to hesitate more during spontaneous speech than during oral reading [Min74].
Lass states that juncture pauses are important for comprehension, so they cannot be eliminated or reduced without interfering with comprehension [LL77]. Theories about memory suggest large-capacity rapid-decay sensory storage followed by limited capacity perceptual memory. Studies have shown that increasing silence intervals between words increases recall accuracy. Aaronson suggests that for a fixed amount of compression, it may be optimal to delete more from the words than from the intervals between the words. She states that ``English is so redundant that much of the word can be eliminated without decreasing intelligibility, but the interword intervals are needed for perceptual processing'' [AMS71].
This paper reviews of a variety of techniques for time-compressing speech,
as well as related perceptual limits of intelligibility and comprehension.
The SOLA method is currently favored for real-time applications, however,
a digital version of the Fairbanks sampling method can easily be implemented
and produces fair speech quality with little computation.
Time-compressed speech has recently begun showing up in voice applications and computer interfaces that use speech [WSB92]. Allowing the user to interactively change the speed at which speech is presented is important in getting over the ``time bottleneck'' often associated with voice interfaces. The techniques described in this paper can thus aid in user acceptance of voice applications.
Though dated, the most readily accessible, and most often cited, reference is [FS69]. Another broad and more recent summary is [BM76]. An extensive anthology and bibliography [Duk74b] that contains copies and extracts of many earlier works is still in print.
Lisa Stifelman and Eric Hulteen provided comments on a draft of this
This work was sponsored by Apple Computer, Inc.