This paper origninally appeared in Proceedings of 1992 Conference, American
Voice I/O Society, Sep. 1992, pp. 169-177.
Barry Arons
Speech Research Group, MIT Media Lab
20 Ames Street, E15-353, Cambridge MA 02139
+1 617-253-2245
barons@media-lab.mit.edu
There are a variety of techniques for time-compressing speech that have been developed over the last four decades. This paper consists of a review of the literature on methods for time-compressing speech, including related perceptual studies of intelligibility and comprehension.
The primary motivation for time-compressed speech is for reducing the
time needed for a user to listen to a message--to increase the communication
capacity of the ear. A secondary motivation is that of data reduction--to
save storage space and transmission bandwidth for speech messages.
Time-compressed speech can be used in a variety of application areas including
teaching, aids to the disabled, and human-computer interfaces. Studies have
indicated that listening to teaching materials twice that have been speeded
up by a factor of two is more effective than listening to them once at normal
speed [Sti69]. Time-compressed speech has
been used to speed up message presentation in voice mail systems [Hej90, Max80], and in aids for
the blind. Speech can be slowed for learning languages, or for the hearing
impaired. Time compression techniques have also been used in speech recognition
systems to time normalize input utterances to a standard length [Mal79].
While the utility of time compressing recordings is generally recognized,
surprisingly, its use has not become pervasive. Rippey performed an informal
study on users of a time-compression tape player installed in a university
library. Virtually all the comments were positive, and the librarians reported
that the speech compressor was the most popular piece of equipment in the
library [Rip75].
The lack of commercial acceptance of time-compressed speech is partly because
of the cost of compression devices and the quality of the reproduced speech,
but is also attributable to the lack of user control. Traditionally, recordings
were reproduced at fixed compression ratios where `` the rate of listening
is completely paced by the recording and is not controllable by the listener.
Consequently, the listener cannot scan or skip sections of the recording
in the same manner as visually scanning printed text, nor can the listener
slow down difficult-to-understand portions of the recording'' [Por78].
Powerful computer workstations with speech input/output capabilities make
high quality time-compressed speech readily available. It is now practical
to integrate speech time-compression techniques into interactive voice applications,
and the software infrastructure of workstations, portable, and hand-held
computers to provide user interfaces for high-speed listening.
There are three variables that can be studied in compressed speech [Duk74a]:
Other related factors come into play in the context of integrating speech into computer workstations or hand-held computers:
There are several ways to express the amount of compression produced by the techniques described in this document. The most common figure in the literature is the compression percentage (footnote-1). A compression of 50% corresponds to a factor of 2 increase in speed (requiring half the time to play). A compression of 20% corresponds to a factor of 5 increase in speed. These numbers are most easily thought of as a percent reduction in time or data.
Time-compressed speech is also referred to as accelerated, compressed, time-scale
modified, sped-up, rate-converted, or time-altered speech (footnote-2). There are a variety of techniques for changing
the playback speed of speech--a survey of these methods are described briefly
in the following sections. Note that these techniques are primarily concerned
with reproducing the entire recording, not scanning portions of the signal.
Most of these methods also work for slowing speech down, but this is not
of primary interest. Much of the research summarized here was performed
between the mid 1950's and the mid 1970's, often in the context of accelerated
teaching techniques, or aids for the blind.
The normal English speaking rate is between 130-200 words per minute (wpm). When speaking fast, a talker unintentionally changes relative attributes of his speech such as pause durations, consonant-vowel duration, etc. Talkers can only compress their speech to about 70% because of physiological limitations (footnote-3) [BM76].
Speed changing is analogous to playing a tape recorder at a faster (or slower) speed. This method can be replicated digitally by changing the sampling rate during the playback of a sound. These techniques are undesirable since they produce a frequency shift proportional to the change in playback speed, causing a decrease in intelligibility.
With purely synthetic speech it is possible to generate speech at a variety of word rates. Current text-to-speech synthesizers can produce speech at rates up to 550 wpm. This is typically done by selectively reducing the phoneme and silence durations. This technique is powerful, particularly in aids for the disabled, but is not relevant to recorded speech.
Vocoders that extract pitch and voicing information can be used to time-compress speech. Most vocoding efforts, however, have focused on bandwidth reduction rather than on naturalness and high speech quality. The phase vocoder, described in section 7.2, is an exception.
A variety of techniques can be used to find silences (pauses) in speech
and remove them. The resulting speech is ``natural, but many people find
it exhausting to listen to because the speaker never pauses for breath''
[Neu78]. The simplest methods involve
the use of energy or average magnitude measurements combined with time thresholds;
other metrics include zero-crossing rate measurements, LPC parameters, etc.
A variety of speech/silence detection techniques are reviewed in detail
in [Aro92].
Maxemchuk [Max80] used 62.5ms frames of speech
corresponding to disk blocks (512 bytes of 8kHz, 8-bit -law data). For computational
efficiency, only a pseudo-random sample of 32 out of every 512 values were
looked at to determine low energy portions of the signal. Several successive
frames had to be above or below a threshold in order for a silence or speech
determination to be made.
TASI (Time Assigned Speech Interpolation) is used to approximately double
the capacity of existing transoceanic telephone cables [MS62].
Talkers are assigned to a specific channel while they are speaking; the
channel is then freed during silence intervals. During busy hours, a talker
will be assigned to a different channel about every other ``talkspurt''.
The TASI speech detector is necessarily a real-time device, and must be
sensitive enough to prevent clipping of the first syllable. However, if
it is too sensitive, the detector will trigger on noise and the system will
operate inefficiently. The turn-on time for the TASI speech detector is
5ms, while the release time is 240ms. The newer DSI (Digital Speech Interpolation)
technique is similar, but works entirely in the digital domain. Note that
Maxemchuk's system was primarily concerned with reducing the time a listener
needed to hear a message and minimizing storage requirements. DSI/TASI are
concerned with conserving network bandwidth.
More sophisticated energy and time heuristics ([LRRW81,
RS75], summarized in [O'S87])
are used in endpoint detection for isolated word recognition--to ensure
that words are not inadvertently clipped. The algorithms for such techniques
are more complex than those mentioned above, and such fine-grained accuracy
is probably not necessarily for compressed speech or speech scanning.
The basis of much of the research in time-compressed speech was originated
in 1950 by Miller and Licklider with experiments that demonstrated the temporal
redundancy of speech. The motivation for this work was to increase channel
capacity by switching speech on and off at regular intervals so the channel
could be used for another transmission (see figures 1
and 2B). It was established that if interruptions
were made at frequent intervals, large portions of a message could be deleted
without affecting intelligibility [ML50].

Figure 1: Sampling terminology [FK57]
Other researchers concluded that listening time could be saved by abutting
the interrupted speech segments. This was first done by Garvey who manually
spliced audio tape segments together [Gar53a,
Gar53b], then by Fairbanks with a modified
tape recorder with four rotating pickup heads (footnote-4) [FEJ54]. The
bulk of literature involving the intelligibility and comprehension of time-compressed
speech is based on such electromechanical tape recorders.
In the Fairbanks, or sampling, technique, segments of the speech signal
are alternatively discarded and retained (figure 2C).
This has traditionally been done isochronously--at constant sampling intervals
without regard to the contents of the signal. Implementing such an algorithm
on a general purpose processor is straightforward.

Figure 2: Sampling techniques
Word intelligibility decreases if is too large or too small. Portnoff [Por81] notes that the duration of each sampling
interval should be at least as long as one pitch period (e.g., > 15ms),
but should also be shorter than the length of a phoneme. Although computationally
simple, such time-domain techniques introduce discontinuities at the interval
boundaries that are perceived as ``burbling'' distortion and general signal
degradation.
It has been noted that some form of windowing function or digital smoothing
at the junctions of the abutted segments will improve the audio quality.
The ``braided-speech'' method continually blended adjacent segments with
linear fades, rather than abutting segments [Que74].
Lee describes two digital electronic implementations of the sampling technique
[Lee72], and discusses the problems of
discontinuities when segments are simply abutted together.
One interesting variant of the sampling method (figure 2D)
is achieved by playing the standard sampled signal to one ear and the ``discarded''
material to the other ear (footnote-5)
([Sco67] summarized in [Orr71]).
Under this dichotic (footnote-6)
condition, intelligibility and comprehension increase. Most subjects also
prefer this technique to a diotic presentation of a conventionally sampled
signal. Listeners initially reported a switching of attention between ears,
but they quickly adjusted to this unusual sensation. Note that for compression
ratios up to 50%, the two signals to the ears contain common information.
For compressions greater than 50% some information is necessarily lost.
The basic sampling technique periodically removes pieces of the speech
waveform without regard to whether it contains any redundant speech information.
David and McDonald demonstrated a bandwidth reduction technique in 1956
that selectively removed (redundant) pitch periods from speech signals [DM56]. Scott applied the same ideas to
time compression, setting the sampling and discard intervals to be synchronous
with the pitch periods of the speech. Discontinuities in the time compressed
signal were reduced, and intelligibility increased [SG72].
Neuburg developed a similar technique in which intervals equal to the pitch
period were discarded (but not synchronous with the pitch pulses). Finding
the pitch pulses is hard, yet estimating the pitch period is much easier,
even in noisy speech [Neu78].
Since frequency-domain properties are expensive to compute, it has been
suggested that easy-to-extract time-domain features can be used to segment
speech into transitional and sustained segments. For example, simple amplitude
and zero crossing measurements for 10ms frames can be used to group adjacent
frames for similarity--redundant frames can then be selectively removed
[Que74]. Toong [Too74]
selectively deleted 50-90% of vowels, up to 50% of consonants and fricatives,
and up to 100% of silence. However, he found that complete elimination of
silences was undesirable (see also section 9.4).
The synchronized overlap add method (SOLA) first described by Roucos
and Wilgus [RW85] has recently become popular
in computer-based systems. It is a fast non-iterative optimization of a
fourier-based algorithm described in [GL84].
``Of all time scale modification methods proposed, SOLA appears to be the
simplest computationally, and therefore most appropriate for real-time applications''
[WRW89]. Conceptually, the SOLA method consists
of shifting the beginning of a new speech segment over the end of the preceding
segment to find the point of highest cross-correlation. Once this point
is found, the frames are overlapped and averaged together, as in the sampling
method. This technique provides a locally optimal match between successive
frames(footnote-7); combining
the frames in this manner tends to preserve the time-dependent pitch, magnitude,
and phase of a signal. The shifts do not accumulate since the target position
of a window is independent of any previous shifts [Hej90].
The SOLA method is simple and effective as it does not require pitch extraction,
frequency-domain calculations, phase unwrapping, and is non-iterative [ME86]. The SOLA technique can be considered a type
of selective sampling that effectively removes redundant pitch periods.
A windowing function can be used with this technique to smooth between segments,
producing significantly less artifacts than traditional sampling techniques.
Makhoul used both linear and raised cosine functions for averaging windows,
and found the simpler linear function sufficient [ME86].
The SOLA algorithm is robust in the presence of correlated or uncorrelated
noise, and can improve the signal to noise ratio of noisy speech [WW88, WRW89].
Several improvements to the SOLA method have been suggested that offer improved
computational efficiency, or increased robustness in compression/decompression
applications [ME86, WW88,
WRW89, Har90, Hej90]. Hejna, in particular, provides a detailed
description of SOLA, including an analysis of the interactions of various
parameters used in the algorithm.
In addition to the frequency domain methods outlined in this section, there are a variety of other frequency-based techniques that can be used for time compressing speech (e.g., [MQ86, QM86]).
Harmonic compression involves the use of a fine-tuned (typically analog)
filter bank. The energy outputs of the filters are used to drive filters
at half the frequency of the original. A tape of the output of this system
is then played on a tape recorder at twice normal speed. The compression
ratio of this frequency domain technique was fixed, and was being developed
before the time when it was practical for digital computers to be used for
time-compression.
Malah describes time-domain harmonic scaling which requires pitch estimation,
is pitch synchronous, and can only accommodate certain compression ratios
[Mal79, Lim83].
A vocoder that maintains phase [Dol86]
can be used for time-compression. A phase vocoder can be interpreted as
a filterbank and thus is similar to the harmonic compressor. A phase vocoder
is, however, significantly more complex because calculations are done in
the frequency domain, and the phase of the original signal must be reconstructed.
Portnoff [Por81] developed a system for
time-scale modification of speech based on short-time Fourier analysis.
His system provided high quality compression of up to 33% while retaining
the natural quality and speaker-dependent features of the speech. The resulting
signals were free from artifacts such as glitches, burbles, and reverberations
typically found in time-domain methods of compression such as sampling.
Phase vocoding techniques are more accurate than time domain techniques,
but are an order of magnitude more computationally complex because Fourier
analysis is required. The phase vocoder is particularly good at slowing
speech down to hear features that cannot be heard at normal speed--such
features are typically lost using time domain techniques. Dolson says ``a
number of time-domain procedures can be employed at substantially less computational
expense. But from a standpoint of fidelity (i.e., the relative absence of
objectionable artifacts), the phase vocoder is by far the most desirable.''
The time-compression techniques described above can be mixed and matched in a variety of ways. Such combined methods can provide a variety of signal characteristics and a range of compression ratios.
Maxemchuk [Max80] found that eliminating every
other non-silent block (1/16th second) produced ``extremely choppy and virtually
unintelligible playback.'' Eliminating intervals with less energy than the
short-term average (and no more than one in a row), produced distorted but
intelligible speech. This technique produced compressions of 33 to 50 percent.
Maxemchuk says that this technique `` has the characteristic that those
words which the speaker considered to be most important and spoke louder
were virtually undistorted, whereas those words that were spoken softly
are shortened. After a few seconds of listening to to this type of speech,
listeners appear to be able to infer the distored words and obtain the meaning
of the message.'' He believes such a technique would be ``useful for users
of a message system to scan a large number of messages and determine which
they wish to listen to more carefully or for users of a dictation system
to scan a long document to determine the areas they wish to edit.''
Silence compression and sampling can be combined in several ways. Silences
can first be removed from a signal that is then sampled. Alternatively,
the output of a silence detector can be used to set boundaries for sampling,
producing a selective sampling technique. Note that using silences to find
discard intervals eliminates the need for a windowing function to smooth
(de-glitch) the sound at the boundaries of the sampled intervals.
On the surface it would appear that removing silences and time-compressing
speech using a technique such as the overlap-add method would be linearly
independent, and could thus be performed in either order. In practice there
are some minor differences, because the SOLA algorithm makes assumptions
about the properties of the speech signal. The Speech Research Group has
informally found a slight improvement in speech quality by applying the
SOLA algorithm before removing silences. Note that timing parameters must
be modified under these conditions. For example with speech compressed 50%,
the silence removal timing thresholds must also be cut in half.
This combined technique is effective, and can produce a fast and dense speech
stream. Note that silence periods can be selectively retained or shortened,
rather than simply removed.
A sampled signal compressed by 50% can be presented dichotically so that exactly half the signal is presented to one ear, while the remainder of the signal is presented to the other ear. Generating such a lossless dichotic presentation is difficult with the SOLA method because the segments of speech are shifted relative to one another to find the point of maximum similarity. However, by choosing two starting points in the speech data carefully (based on the parameters used in the SOLA algorithm), it is possible to maximize the difference between the signals presented to the two ears. We have informally found this technique to be effective since it combines the high quality sounds produced with the SOLA algorithm with the binaural effect of the dichotic presentation.
There has been a significant amount of perceptual work performed in the areas of intelligibility and comprehension of time-compressed speech. Much of this research has been summarized in [BM76], [FS69], and [Fou71].
``Intelligibility'' usually refers to the ability to identify isolated
words. Depending on the type of experiment, such words may either be selected
from a closed set, or written down (or shadowed) by the subject from an
open-ended set. ``Comprehension'' refers to the understanding of the content
of the material. This is usually tested by asking questions about a passage
of recorded material.
In general, intelligibility is more resistant to degradation as a function
of time-compression than is comprehension [Ger74].
Early studies showed that single well-learned phonetically balanced words
could remain intelligible with a 10-15% compression (10 times normal speed),
while connected speech remains comprehensible to a 50% compression (twice
normal speed).
There are some practical limitations on the maximum amount that a speech
signal can be compressed. Portnoff notes that arbitrarily high compression
ratios are not physically reasonable. He considers, for example, a voiced
phoneme containing four pitch periods. Greater than 25% compression reduces
this phoneme to less than one pitch period, destroying its periodic character.
Thus, he expects high compression ratios to produce speech with a rough
quality and low intelligibility [Por81].
The ``dichotic advantage'' (section 6.2) is
maintained for compression ratios of up to 33%. For discard intervals between
40-70ms, dichotic intelligibility was consistently higher than diotic intelligibility
[GW77]. A dichotic discard interval
of 40-50ms was found to have the highest intelligibility (40ms was described
as the ``optimum interval'' in another study [Ger74].
Earlier studies suggest that a shorter interval of 18-25ms may be better
for diotic speech [BM76]).
Gerber showed that 50% compression presented diotically was significantly
better than 25% compression presented dichotically, even though the information
quantity of the presentations was the same. These and other data provide
conclusive evidence that 25% compression is too fast for the information
to be processed by the auditory system. The loss of intelligibility, however,
is not due to the loss of information because of the compression process
[Ger74].
Foulke [FS69] reported that
comprehension declines slowly up to a word rate of 275wpm, but more rapidly
beyond that point. The decline in comprehension was not attributable to
intelligibility alone, but was related to a processing overload of short-term
memory. Recent experiments with French have shown that intelligibility and
comprehension do not significantly decay until a high rate (300wpm) is reached
[RSLM88].
Note that in much of the literature the limiting factor that is often cited
is word rate, not compression ratios. The compression required to boost
the speech rate to 275 words per minute is both talker and context dependent
(e.g., read speech is typically faster than spontaneous speech).
Foulke and Sticht permitted sighted college students to select a preferred
degree of time-compression for speech spoken at an original rate of 175wpm.
The mean preferred compression was 82% corresponding to a word rate of 212wpm.
For blind subjects it was observed that 64-75% time-compression and word
rates of 236-275 words per minute were preferred. These data suggest that
blind subjects will trade increased effort in listening to speech for a
greater information rate and time savings [ZDS68].
Comprehension of interrupted speech (as in [ML50])
was good, probably because the temporal duration of the original speech
signal was preserved, providing ample time for subjects to attempt to process
each word [HLLB86].
Compression necessitates that each portion of speech be perceived in less
time than normal. However, each unit of speech is presented in a less redundant
context, so that more time per unit is required. Based on the large body
of work in compressed speech, Heiman suggests that 50% compression removes
virtually all redundant information. With greater than 50% compression,
critical non-redundant information is also lost. They conclude that the
compression ratio rather than word rate is the crucial parameter, because
greater than 50% compression presents too little of the signal in too little
time for a sufficient number of words to be accurately perceived. They believe
that the 275 wpm rate is of little significance, but that compression and
its underlying temporal interruptions decrease word intelligibility that
results in decreased comprehension.
As with other cognitive activities, such as listening to synthetic speech,
exposure to time-compressed speech increases both intelligibility and comprehension.
There is a novelty in listening to time-compressed speech for the first
time that is quickly overcome with experience.
Even naive listeners can tolerate compressions of up to 50%, and with 8-10
hours of training, substantially higher speeds are possible [OFW65].
Orr hypothesizes that ``the review of previously presented material could
be more efficiently accomplished by means of compressed speech; the entire
lecture, complete with the instructor's intonation and emphasis might be
re-presented at high speed as a review.'' Voor found that practice increased
comprehension of rapid speech, and that adaptation time was short (minutes
rather than hours) [VM65].
Beasley reports on an informal basis that following a 30 minute or so exposure
to compressed speech, listeners become uncomfortable if they are forced
to return to the normal rate of presentation [BM76].
He also reports on a controlled experiment extending over a six week period
that found subjects' listening rate preference shifted to faster rates after
exposure to compressed speech.
It may not be desirable to completely remove silences, as they often
provide important semantic and syntactic cues. With normal prosody, intelligibility
was higher for periodic segmentation (inserting silences after every eighth
word (footnote-8)) than for
syntactic segmentation (inserting silences after major clause and sentence
boundaries) [WLS84].
Wingfield says that ``time restoration, especially at high compression ratios,
will facilitate intelligibility primarily to the extent that these presumed
processing intervals coincide with the linguistic structure of the speech
materials.''
In another experiment, subjects were allowed to stop time-compressed recordings
at any point, and were instructed to repeat what they had heard [WN80]. It was found that the average
reduction in selected segment duration was almost exactly proportional to
the increase in the speech rate. For example, the mean segment duration
for the normal speech was 3s, while the chosen segment duration of speech
compressed 60% was 1.7s. Wingfield found that ``while time and/or capacity
must clearly exist as limiting factors to a theoretical maximum segment
size which could be held [in short-term memory] for analysis, speech content
as defined by syntactic structure, is a better predictor of subjects' segmentation
intervals than either elapsed time or simple number of words per segment.
This latter finding is robust, with the listeners' relative use of the [syntactic]
boundaries remaining virtually unaffected by increasing speech rate.''
In the perception of normal speech, it has been found that pauses exerted
a considerable effect on the speed and accuracy with which sentences were
recalled, particularly under conditions of cognitive complexity [Rei80]. Pauses, however, are only useful
when they occur between clauses within sentences--pauses within clauses
are disrupting. When a 330ms pause was inserted ungrammatically, response
time for a particular task was increased by 2s. Pauses suggest the boundaries
of material to be analyzed, and provide vital cognitive processing time.
Maxemchuk found that eliminating silent intervals decreased playback time
of recorded speech with compression ratios of 50 to 75 percent depending
on the talker and material. In his system a 1/8 second pause is inserted
whenever a pause greater or equal to 1 second occurred in a message. This
appeared to be sufficient to prevent different ideas or sentences in the
recorded document from running together. This type of rate increase does
not affect the intelligibility of individual words within the active speech
regions [Max80].
Studies of pauses in speech also consider the duration of the ``non-pause''
or ``speech unit''. In one study of spontaneous speech, the mean speech
unit was 2.3 seconds. Minimum pause durations typically considered in the
literature range from 50-800ms, with the majority in the 250-500ms region.
As the minimum pause duration increases, the mean speech unit length increases
(e.g, for pauses of 200, 400, 600, and 800ms, the corresponding speech unit
lengths were 1.15, 1.79, 2.50, and 3.52s respectively). In another study,
it was found that inter-phrase pauses were longer and occurred less frequently
than intra-phrase pauses (data from several articles summarized in [Agn74]).
Hesitation pauses are not under the conscious control of the talker, and
average 200-250ms. Juncture pauses are under talker control, and average
500-1000ms. Several studies have shown that breath stops in oral reading
are about 400ms. In a study of the durational aspects of speech, it was
found that the silence and speech unit durations were longer for spontaneous
speech than for read speech, and that the overall word rate was slower.
The largest changes occurred in the durations of the silence intervals.
The greater number of long silence intervals were assumed to reflect the
tendency for speakers to hesitate more during spontaneous speech than during
oral reading [Min74].
Lass states that juncture pauses are important for comprehension, so they
cannot be eliminated or reduced without interfering with comprehension [LL77]. Theories about memory suggest large-capacity
rapid-decay sensory storage followed by limited capacity perceptual memory.
Studies have shown that increasing silence intervals between words increases
recall accuracy. Aaronson suggests that for a fixed amount of compression,
it may be optimal to delete more from the words than from the intervals
between the words. She states that ``English is so redundant that much of
the word can be eliminated without decreasing intelligibility, but the interword
intervals are needed for perceptual processing'' [AMS71].
This paper reviews of a variety of techniques for time-compressing speech,
as well as related perceptual limits of intelligibility and comprehension.
The SOLA method is currently favored for real-time applications, however,
a digital version of the Fairbanks sampling method can easily be implemented
and produces fair speech quality with little computation.
Time-compressed speech has recently begun showing up in voice applications
and computer interfaces that use speech [WSB92].
Allowing the user to interactively change the speed at which speech is presented
is important in getting over the ``time bottleneck'' often associated with
voice interfaces. The techniques described in this paper can thus aid in
user acceptance of voice applications.
Though dated, the most readily accessible, and most often cited, reference is [FS69]. Another broad and more recent summary is [BM76]. An extensive anthology and bibliography [Duk74b] that contains copies and extracts of many earlier works is still in print.
Lisa Stifelman and Eric Hulteen provided comments on a draft of this
paper.
This work was sponsored by Apple Computer, Inc.