Hearing is the sense with which sound is detected and analysed. Psychoacoustics is concerned with the relationship between the physical characteristics of sound (e.g. intensity, physical location in space) and what is actually perceived by the listener (e.g. loudness, perceived position in space). It is also concerned with the ability to discriminate between different sounds. This section deals with basic aspects of hearing; for other aspects see Absolute pitch; Consonance, §2; Psychology of music, §II; and Sound.
2. Structure and function of the auditory system.
4. Masking and frequency analysis.
5. The perception of loudness.
6. Frequency discrimination and the perception of pitch.
BRIAN C. J. MOORE
Sound usually originates from the vibration of an object. This vibration is impressed upon the surrounding medium (usually air) as a pattern of changes in pressure. The pressure changes are transmitted through the medium and may be heard as sound. Although any sound can be described in terms of sound pressure as a function of time (often called the waveform of the sound), it is often more meaningful to describe sound in a different way, based on a theorem by Fourier, who proved that any complex waveform can be analysed (or broken down) into a series of sinusoids. A sinusoid resembles the sound produced by a tuning-fork, and it is often called a simple tone or a pure tone. The analysis of a sound in this way is called Fourier analysis, and each sinusoid is called a (Fourier) ‘component’ of the complex sound. A plot of the magnitudes of the components as a function of frequency is called the ‘spectrum’ of the sound.
Many sounds produced by musical instruments are periodic, or almost periodic; the waveform repeats at regular time intervals and the repetition rate remains rate remains roughly constant over the duration of a musical note. Such sounds have a clear pitch. Other sounds, such as that of a snare drum, are aperiodic and noise-like. A periodic sound is composed of a number of sinusoids, each of which has a frequency that is an integer multiple of the frequency of a common (not necessarily present) fundamental component. The fundamental component has a frequency equal to the repetition rate of the complex waveform as a whole. The frequency components of the complex sound are known as harmonics and are numbered, the fundamental being given harmonic number 1. The nth harmonic has a frequency which is n times that of the fundamental. The relative magnitudes of the harmonics vary across different instruments. For example, the clarinet has a relatively weak 2nd harmonic and a strong 3rd harmonic.
One of the reasons for representing sounds in terms of their sinusoidal components is that the human auditory system performs a similar analysis. For example, two simultaneous sinusoids, whose frequencies are not too close, are usually heard as two separate tones each with its own pitch. The perceived timbre of steady tones is quite closely related to the spectrum (Plomp, 1976).
Because the auditory system can deal with a huge range of sound pressures, sound level or magnitude is usually expressed using a logarithmic measure known as the Decibel. Each 20 decibel increase in level corresponds to an increase in sound pressure by a factor of ten. For example, a 60 decibel increase corresponds to a 1000-fold increase in sound pressure. Normal conversation typically has a level of 65–70 decibels, while an orchestra playing fortissimo may produce sound levels of 110 decibels at seats close to the front. Musicians seated in front of the brass section in an orchestra may be exposed to sound levels up to 120 decibels, which can be damaging to the ear (see also Loudness).
Fig.1 shows the structure of the peripheral part of the human auditory system. The outer ear is composed of the pinna and the auditory canal or meatus. Sound travels down the meatus and causes the eardrum, or tympanic membrane, to vibrate. These vibrations are transmitted through the middle ear by three small bones, the ossicles (malleus, incus and stapes) to a membrane-covered opening (the oval window) in the bony wall of the spiral-shaped structure of the inner ear, the cochlea.
The cochlea is divided along its length by the basilar membrane, which moves in response to sound. The response to sinusoidal stimulation takes the form of a travelling wave which moves along the membrane, with an amplitude that increases at first and then decreases rather abruptly. Fig.2 shows the instantaneous displacement of the basilar membrane for two successive instants in time, in response to a 200 Hz sinusoid. The line joining the amplitude peaks is called the envelope. The envelope shows a peak at a particular position on the basilar membrane.
The position of the peak in the envelope differs according to the frequency of stimulation. High-frequency sounds (around 15,000 Hz) produce a peak near the oval window, while low-frequency sounds (around 50 Hz) produce a peak towards the other end of the membrane (the apex). Intermediate frequencies produce peaks at intermediate places. Thus, each point on the basilar membrane is ‘tuned’ to a particular frequency. When a sound is composed of several sinusoids with different frequencies, each sinusoid produces a peak at its own characteristic place on the basilar membrane. In effect, the cochlea behaves like a Fourier analyser, although with a less than perfect frequency-analysing power.
Recent measurements of basilar membrane vibration have shown that the membrane is much more selectively tuned than originally found by von Békésy (1960). The better the physiological condition of the membrane, the more selective is the tuning (Khanna and Leonard, 1982). In a normal, healthy ear, each point on the basilar membrane responds with high sensitivity to a limited range of frequencies; higher sound intensities are required to produce a response as the frequency is made higher or lower. This selective tuning and high sensitivity probably reflect an active process; that is, they do not result simply from the mechanical properties of the membrane and surrounding fluid, but depend on biological structures that actively influence the mechanics (Yates, 1995).
Lying above the basilar membrane is a second structure, the tectorial membrane. Between the two membranes are hair cells, which form part of a structure called the organ of Corti (fig.3). The hair cells are divided into two groups by an arch known as the tunnel of Corti. Those on the side of the arch closest to the outside of the cochlea are called outer hair cells, and are arranged in three rows in cats and up to five rows in humans. The hair cells on the other side of the arch form a single row and are called inner hair cells. There are about 25,000 outer and about 3500 inner hair cells. The tectorial membrane, which has a gelatinous structure, lies above the hairs. When the basilar membrane moves up and down, a shearing motion is created between the basilar membrane and the tectorial membrane. As a result, the hairs at the tops of the hair cells are displaced. This leads to excitation of the inner hair cells, which leads in turn to the generation of action potentials in the neurones, or nerve cells, of the auditory nerve. The action potentials are brief electrical ‘spikes’ or ‘impulses’ which travel along the nerve and carry information to the brain. The main role of the outer hair cells may be actively to influence the mechanics of the cochlea so as to produce high sensitivity and selective tuning (Yates, 1995).
Each neurone in the auditory nerve derives its activity from one or more hair cells lying at a particular place on the basilar membrane. Thus, the neurones are ‘tuned’. In addition, nerve firings tend to be phase-locked or synchronized to the time pattern of the stimulating waveform. A given neurone does not necessarily fire on every cycle of the stimulus but, when firings do occur, they occur at roughly the same point on the waveform each time. This phase-locking is lost at high frequencies, above around 5000 Hz.
The absolute threshold of a sound is the minimum detectable level of that sound in the absence of any other external sounds. The sounds are usually delivered by a loudspeaker in a large anechoic chamber (a room whose walls are highly sound-absorbing). The measurement of sound level is made after the listener is removed from the sound field, at the point formerly occupied by the centre of the listener's head.
Fig.4 shows estimates of the absolute threshold of sound at various frequencies. The curve represents the average data from many young listeners with normal hearing. However, individual listeners may have thresholds as much as 20 decibels above or below the mean at a specific frequency and still be considered ‘normal’. Absolute sensitivity is greatest in the frequency range between 2 and 5 kHz, partly because of a broad resonance produced by the ear canal. This frequency range corresponds to the higher formant frequencies (resonances in the vocal tract) of speech sounds. The ‘singing formant’, a resonance in the vocal tract produced by singers to boost frequencies between 2 and 3 kHz, typically falls within this range as well (Sundberg, 1974).
Thresholds increase rapidly at very high and very low frequencies. This effect depends at least partly on the transmission characteristic of the middle ear. Transmission is most efficient for mid-range frequencies and drops off markedly for very low and very high frequencies (Rosowski, 1991). The highest audible frequency varies considerably with the age of the listener. Young children can often hear tones as high as 20 kHz, but for most adults the threshold rises rapidly above about 15 kHz. The loss of sensitivity with increasing age (presbyacusis) is much greater at high frequencies than at low, and the variability between different listeners is also greater at high frequencies.
There is no clear low-frequency limit to human hearing. However, sounds with frequencies below about 16 Hz are not heard in the normal sense, but are detected by virtue of the distortion products (harmonics) that they produce after passing through the middle ear. In addition, very intense low-frequency tones can sometimes be felt as vibration before they are heard. The low-frequency limit for the ‘true’ hearing of pure tones probably lies at about 16 Hz. This is close to the lowest frequency that evokes a pitch sensation.
The auditory system acts as a limited-resolution frequency analyser; complex sounds are broken down into their sinusoidal components. This analysis almost certainly depends mainly on the tuning observed on the basilar membrane. Largely as a consequence of this analysis, we are able to hear one sound in the presence of another sound with a different frequency. This ability is known as frequency selectivity, or frequency resolution. Frequency selectivity plays a role in many aspects of auditory perception, including pitch, timbre and loudness.
Important sounds are sometimes rendered inaudible by other sounds, a process known as ‘masking’. Masking may be considered as a failure of frequency selectivity, and it can be used as a tool to measure the frequency selectivity of the ear. One theory of masking assumes that the auditory system contains a bank of overlapping band-pass filters (Fletcher, 1940; Patterson and Moore, 1986). Each of these ‘auditory filters’ is assumed to respond to a limited range of frequencies. In the simple case of a sinusoidal signal presented in a background noise, it is assumed that the listener detects the signal using the filter whose output has the highest signal-to-masker ratio. The signal is detected if that ratio exceeds a certain value. In most situations, the filter involved has a centre frequency close to that of the signal.
A good deal of work has been directed towards determining the characteristics of the auditory filters (see Moore, 4/1997). One way of characterizing a filter is in terms of the range of frequencies to which it responds most strongly. This range is referred to as the ‘bandwidth’. The bandwidth of an auditory filter estimated from masking experiments is often called the ‘critical bandwidth’ (Fletcher, 1940; Zwicker, 1961), although more recently the term ‘equivalent rectangular bandwidth’ has been used (Moore and Glasberg, 1983; Glasberg and Moore, 1990). This is defined as the frequency range covered by a rectangular filter with the same peak value and which passes the same total power of white noise (a sound containing equal energy at all frequencies). When we listen to a complex sound containing many partials, an individual partial can be ‘heard out’ (perceived as separate tone) when it is separated from neighbouring partials by a little more than one equivalent rectangular bandwidth (Moore and Ohgushi, 1993). For harmonic complex tones, this means that only the lower harmonics (up to the 5th to 8th) can be heard out (Plomp, 1964).
The Loudness of a given sound generally increases with increasing physical intensity. However, two sounds with the same intensity may appear very different in loudness, since loudness is also affected strongly by the spectrum of the sounds. It is useful to have a scale that allows one to compare the loudness of different sounds. A first step towards this is to construct equal-loudness contours for sinusoids of different frequencies. Say, for example, we take a standard tone of 1 kHz at a level of 40 decibels, and ask the listener to adjust the level of a second tone (say, 2 kHz) so that it sounds equally loud. If we repeat this for many different frequencies of the second tone, then the sound level required, plotted as a function of frequency, maps out an equal-loudness contour. If we repeat this procedure for different levels of the 1 kHz standard tone, then we will map out a family of equal-loudness contours (fig.5). Note that the contours resemble the absolute threshold curve (lowest curve in the figure) at low levels, but tend to become flatter at high levels. As a result, the relative loudness of different frequencies can change with overall sound level. For example, a 100 Hz tone at 40 decibels would sound quieter than a 1000 Hz tone at 30 decibels. However, if both tones were increased in level by 60 decibels, the 100 Hz tone at 100 decibels would sound louder than the 1000 Hz tone at 90 decibels.
The subjective loudness of a sound is not directly proportional to its physical intensity. For sound levels above about 40 decibels, the loudness roughly doubles when the intensity is increased by a factor of ten, which is equivalent to adding 10 decibels; (Stevens, 1957). This property of the ear has important implications for the perception of musical sounds. For example, ten violins each playing with the same intensity will sound only twice as loud as a single violin, and 100 violins will sound only four times as loud as a single violin.
Pitch is defined as the attribute of auditory sensation in terms of which sounds may be ordered on a musical scale, that is, the attribute in which variations constitute melody (see Pitch). For sinusoids (pure tones) the pitch is largely determined by the frequency: the higher the frequency, the higher the pitch. One of the classic debates in hearing theory is concerned with the mechanisms underlying the perceptions of pitch. One theory, called the ‘place’ theory, suggests that pitch is related to the position of maximum vibration on the basilar membrane, which is coded in terms of the relative activity of neurones tuned to different frequencies. The alternative theory, the ‘temporal’ theory, suggests that pitch is determined by the time pattern of neural spikes (phase-locking).
One major fact that these theories have to account for is our remarkably fine acuity in detecting frequency changes. This ability is called frequency discrimination and is not to be confused with frequency selectivity. Some results of measurements of this ability, for sinusoids with various frequencies and durations are shown in fig.7. For two tones presented successively and lasting 500 milliseconds, a difference of about 3 Hz (or less in trained subjects) can be detected at a centre frequency of 1 kHz. It has been suggested that tuning-curves (or auditory filters) are not sufficiently sharp to account for this acuity in terms of the place theory (Moore and Glasberg, 1986). A further difficulty for the place theory is that frequency discrimination worsens abruptly above 4 or 5 kHz (Moore, 1973). Neither neural measures of frequency selectivity (such as tuning-curves) nor psychoacoustical measures of frequency selectivity (such as auditory filters) show any abrupt change there.
These facts can be explained by assuming that temporal mechanisms are dominant at frequencies below 4 –5 kHz. The worsening performance for frequencies above this level corresponds well with the frequency at which the temporal information ceases to be available. Studies of our perception of musical intervals also indicate a change in mechanism around 4–5 kHz (Ward, 1954). Below this, a sequence of pure tones with appropriate frequencies conveys a clear sense of melody. Above this, the sense of musical interval and of melody is lost, although the changes in frequency may still be heard. The important frequencies for the perception of music and speech lie in the frequency range where temporal information is available.
When we listen to a complex tone, such as that produced by a musical instrument or a singer, the pitch usually corresponds to the fundamental component. However, the same pitch is heard when the fundamental component is weak or absent completely, an effect called ‘the phenomenon of the missing fundamental’. It appears that the perceived pitch is somehow constructed in the brain from the harmonics above the fundamental (Moore, 4/1977, #2731).
Two major cues for sound localization are differences in the time of arrival and differences in intensity at the two ears. For example, a sound coming from the left will arrive first at the left ear and be more intense in the left ear. For steady sinusoidal stimulation, differences in time of arrival can be detected and used to judge location only for frequencies below about 1500 Hz. At low frequencies, very small changes in relative time of arrival at the two ears can be detected, of about 10–20 millionths of a second, which is equivalent to a lateral movement of the sound source of one to two degrees.
Intensity differences between the two ears are primarily useful at high frequencies. This is because low frequencies bend or diffract around the head, so that there is little difference in intensity at the two ears whatever the location of the sound source. At high frequencies the head casts more of an acoustic ‘shadow’, and above 2–3 kHz the intensity differences are sufficient to provide useful cues. For complex sounds, containing a range of frequencies, the difference in spectral patterning at the two ears may also be important.
Binaural cues are not sufficient to account for all of our localization abilities. For example, a difference in time or intensity will not define whether a sound is coming from in front or behind, or above or below, but people can clearly make such judgments. The extra information is provided by the pinnae (Grantham, 1995; see fig.1 above). The spectra of sounds entering the ear are modified by the pinnae in a way that depends on the direction of the sound source. This direction-dependent filtering provides cues for sound-source location. The cues occur mainly at high frequencies, above about 6 kHz. The pinnae are important not only for localization, but also for judging whether a sound comes from within the head or from the outside world. A sound is judged as coming from outside only if the spectral transformations characteristic of the pinnae are imposed on it. Thus, sounds heard through headphones are normally judged as being inside the head; the pinnae do not have their normal effect when headphones are worn. However, sounds delivered by headphones can be made to appear to come from outside the head if the signals delivered to the headphones are pre-recorded on a dummy head or synthetically processed (filtered) so as to mimic the normal action of the pinnae. Such processing can also create the impression of a sound coming from any desired direction in space.
H. Fletcher: ‘Auditory Patterns’, Review of Modern Physics, xii (1940), 47–65
G. von Békésy: ‘The Variations of Phase along the Basilar Membrane with Sinusoidal Vibrations’, JASA, xix (1947), 452–60
W.D. Ward: ‘Subjective Musical Pitch’, JASA, xxvi (1954), 369–80
S.S. Stevens: ‘On the Psychophysical Law’, Psychological Review, xiv (1957), 153–81
G. von Békésy: Experiments in Hearing (Eng. trans., New York, 1960)
E. Zwicker: ‘Subdivision of the Audible Frequency Range into Critical Bands (Frequenzgruppen)’, JASA, xxxiii (1961), 248 only [letter to editor]
R. Plomp: ‘The Ear as a Frequency Analyzer’, JASA, xxxvi (1964), 1628–36
B.C.J. Moore: ‘Frequency Difference Limens for Short-Duration Tones’, JASA, liv (1973), 610–19
J. Sundberg: ‘Articulatory Interpretation of the “Singing Formant”’, JASA, lv (1974), 838–44
R. Plomp: Aspects of Tone Sensation (London, 1976)
S.M. Khanna and D.G.B. Leonard: ‘Basilar Membrane Tuning in the Cat Cochlea’, Science, ccxv (1982), 305–6
B.C.J. Moore and B.R. Glasberg: ‘Suggested Formulae for Calculating Auditory-Filter Bandwidths and Excitation Patterns’, JASA, lxxiv (1983), 750–53
B.C.J. Moore and B.R. Glasberg: ‘The Role of Frequency Selectivity in the Perception of Loudness, Pitch and Time’, Frequency Selectivity in Hearing, ed. B.C.J. Moore (London, 1986), 251–308
R.D. Patterson and B.C.J. Moore: ‘Auditory Filters and Excitation Patterns as Representations of Frequency Resolution’, Frequency Selectivity in Hearing, ed. B.C.J. Moore (London, 1986), 123–77
L. Robles, M.A. Ruggero and N.C. Rich: ‘Basilar Membrane Mechanics at the Base of the Chinchilla Cochlea I: Input-Output Functions, Tuning Curves, and Response Phases’, JASA, lxxx (1986), 1364–74
A.R. Palmer: ‘Physiology of the Cochlear Nerve and Cochlear Nucleus’, Hearing, ed. M.P. Haggard and E.F. Evans (Edinburgh, 1987), 838–55
B.R. Glasberg and B.C.J. Moore: ‘Derivation of Auditory Filter Shapes from Notched-Noise Data’, Hearing Research, xlvii (1990), 103–38
J.J. Rosowski: ‘The Effects of External and Middle-Ear Filtering on Auditory Threshold and Noise-Induced Hearing Loss’, JASA, xc (1991), 124–35
B.C.J. Moore and K. Ohgushi: ‘Audibility of Partials in Inharmonic Complex Tones’, JASA, xciii (1993), 452–61
D.W. Grantham: ‘Spatial Hearing and Related Phenomena’, Hearing, ed. B.C.J. Moore (San Diego, 1995), 297–345
G.K. Yates: ‘Cochlear Structure and Function’, Hearing, ed. B.C.J. Moore (San Diego, 1995), 41–73
Acoustics: Reference Zero for the Calibration of Audiometric Equipment, Part 7: Reference Threshold of Hearing Under Free-Field and Diffuse-Field Listening Conditions (Geneva, 1996) [ISO 389–7]
B.C.J. Moore: An Introduction to the Psychology of Hearing (San Diego, 4/1997) [orig. pubd, Baltimore and London, 1977]