Speech Gestures: The new normal(isation)

Speech, we are told, is a messy beast.  Hockett’s Easter Egg analogy (1955) is perhaps one of the most drastically pessimistic and most reflective metaphor of the received wisdom about the phonologists’ struggle.

Imagine a row of differently coloured easter eggs which are carried along a conveyor belt, only to be smashed to bits by mechanical hammer, leaving debris in its wake.  With a mess of albumen, yolk and smashed shell intermingled from neighbouring eggs, it’s hard to understand how we’d begin to put them together again, let alone work out what they were supposed to look like in the first place.

Easter egg tradition

The idealised mental units of intended speech tokens, in this view, must contend with the interfering, material medium of the vocal tract.  In the quest for meaning and regularity in the speech signal, all we seem to find is the wreckage of these mental tokens’ passage into the world.  We find acoustic smearing by coarticulatory effects, making it seemingly impossible to know where one sound starts and another begins, and we find acoustic properties to be infuriatingly contingent on the properties of every speaker’s vocal tracts, rather than simply their linguistic identity.

It seems that all we’re left with is the ingredients for an omelette we never really wanted to make.

In the first of two posts, I’m going to look at the second of these two issues, speaker-dependence, followed up by a treatment of coarticulation in the second.

Speaker Normalisation

Vowels and their acoustic identities are a paradigm case to illustrate the problem at hand.  The speech signal shows signature properties for vowels called formants.  These are the effect of the resonant frequencies of the vocal tract on the periodic glottal pulse from the vocal folds.  The first formant varies relatively systematically with tongue height and the second with tongue backness.  The majority of vowels can be described in terms of these two parameters (Fant, 1960)1 .

While these properties covary somewhat reliably within an individual, every vocal tract is different. Individual variation in shape and size affects both the resonant frequencies of the tract and the fundamental frequency of an individual’s speech, thus the harmonic resonances of voicing.  These indicators can be relied upon somewhat robustly within an individual, but this view would predict problems for a listener, for whom the acoustic properties specifying a vowel varies depending on speaker context.  The ambiguity is demonstrated in the following graph – where there is substantial overlap of vowel clusters in this space.  Despite this, all the following tokens are robustly recognised as the intended vowels.


Vowel clusters – Peterson & Barney (1952)

Quest for mechanisms

Since we are in fact very good at recognising phonemes across speaker contexts – from young ages, at that (e.g Grieser & Kuhl, 1989) – the solution has been assumed to lie in the search for a normalisation mechanism.  This would operate on formants to factor out individual variation and the resulting representation would yield access to vowel identity.

Most theories propose an online mechanism to transform F1-F2 space into speaker-independent units (see Johnson, 2005).  These transformations have ranged from formant ratio normalisation – drawing analogy to chord recognition independent of pitch (Potter & Steinberg, 1950) – to F0 and F3 modulation of values.  However, these rarely perform better than statistical standardisation techniques used as a practical measure to reduce between-speaker scatter in sociophonetic research (Adank et al., 2004), rather than as a putative cognitive mechanism.

Perception of gestures

According to the direct realist theory of speech perception (e.g. Fowler, 1986), listeners do not perceive the acoustic properties of speech sounds, but the gestures that structure the acoustic medium.  The doctrine of dynamical articulation to acoustic kinematic mapping dictates that it is the physical gestures that constrain the available acoustics in a lawful manner over time.  While it means that the acoustic medium likely specifies the gesture, this could be in the form of a higher level acoustic property, rather the supposed acoustic primitives of F1 and F2.  While F1 and F2 indeed covary with tongue positioning in the tract, they clearly do not uniquely specify it i.e. they are not an invariant property.

An analogous case exists in visual perception, in which a triangle is still perceived as a triangle, independent of the absolute lengths of the sides.  Critically though, the lengths of the sides still covary with the perception, but only in relative terms.  For example, a right angled triangle is still defined by the relative lengths of the sides, such that the length of the hypotenuse squared is equal to the summed squares of the other two sides.

With this in mind, it is clear that the formant ratio approaches were trying to capture a similar property, in which the relative formant values are preserved as a possible invariant.  However, the direct realist (/ecological) approach to perception crucially reinforces the notion that the structure in the sensory array specifies its source i.e. the event in the world that gave rise to it, which in this case is gestures of the vocal tract.  Rather than applying abstract methods of differentiation on the sensory array to look for invariant identities, we should start with the source identity – speech gestures – and see whether and how it structures the array in a way that could support identification. The latter is much more constrained than the former.

Something akin to this has been followed in attempts at vocal tract normalisation, since it is known that the length of the vocal tract has a dramatic effect on the formant values of the speaker.  Monahan & Idsardi (2010) approach explores the possibility that listeners exploit the third formant, since it is a relatively reliable indicator of vocal tract length.  Furthermore, there is EEG evidence that we are sensitive to the ratios between F3 and the first two formants.  This is successful to some extent; however, it seems to fit comfortably in the vein of mechanistic transformations on basic primitives.  The existence of such a mechanism is not needed if we accept that our perceptual systems evolved to perceive events, rather than abstract physical parameters of the world.

The reason for caution regarding this type of mechanism is that it requires that F3 specify vocal tract length uniquely.  While it does indeed covary somewhat with length, it is not a simple identity relationship and is also not the only property impacting vowel identity.  For example, F3 alters with lip rounding and vowel category boundaries are sensitive, though weakly, to changes in formants from F3-F5.  Animals have also demonstrated categorical perception of speech sounds, speaking against the primacy of a specialised human speech mechanism (e.g. Kuhl & Miller, 1978).

This falls into the same trap of providing a relatively correlative relationship and presuming causation. It is indeed likely that vocal tract length and its impact on F3, among other things, forms a necessary component of the information supporting gesture identity; however, it is another thing to propose F3 as a modulating cue participating in a transformation of vowel space.  It is also arbitrary to propose the normalisation mechanism would involve a simple ratio relationship between resonant frequencies and vocal tract length, unless modelling of airflow over this  model supports such a transformation of these parameters.

Where do we go from here?

The solution begged by the direct realist account is unfortunately trickier than simple transformation hacks on F1-F2 vowel space.  It dictates that we look for acoustic invariants that uniquely specify the actions of the vocal tract.  Luckily, due to its constraints according to articulation, it can start from the gestural invariant and work outwards, using techniques such as articulatory synthesis from detailed models of the vocal tract and its simulated interaction with airflow in structuring the sound (e.g. Browman & Goldstein, 1986) .  This approach can account for animal sensitivity to speech sounds and the effects of higher formant manipulation on vowel perception.  A simple lawful causal relationship between gestural identity and movement at any frequency can still be picked up and exploited by a generalised perceptual mechanism.  By a cue-based account, sensitivty to higher formants appears odd, since these primitives by themselves are rarely uniquely or robustly supportive of vowel identity, making them poor candidates for a mechanistic transformative process.

More will be explored about defining speech sounds as gestural invariants in the next post on coarticulation.

1Rhoticity – lip rounding – also affects the second and third formants, but this will not be explored here.


Adank, P., Smits, R., & Van Hout, R. (2004). A comparison of vowel normalization procedures for language variation research. The Journal of the Acoustical Society of America, 116(5), 3099-3107.

Browman, C. P., & Goldstein, L. M. (1986). Towards an articulatory phonology. Phonology, 3(01), 219-252.

Fant, G. (1960). Acoustic theory of speech prodicution. Mouton.

Fowler, C. A. (1986). An event approach to the study of speech perception from a direct-realist perspective. Journal of Phonetics, 14(1), 3-28.

Grieser, D., & Kuhl, P. K. (1989). Categorization of speech by infants: Support for speech-sound prototypes. Developmental Psychology, 25(4), 577.

Hockett, C. F. (1955). A manual of phonology (No. 11). Waverly Press.

Johnson, K. A. (2005). Speaker normalization in speech perception. In D. B. Pisoni & R. E. Remez (Eds.), Handbook of speech perception (pp. 363–389). Oxford: Blackwell.

Kuhl, P. K., & Miller, J. D. (1978). Speech perception by the chinchilla: Identification functions for synthetic VOT stimuli. The Journal of the Acoustical Society of America, 63(3), 905-917.

Monahan, P. J., & Idsardi, W. J. (2010). Auditory sensitivity to formant ratios: Toward an account of vowel normalisation. Language and cognitive processes, 25(6), 808-839.

Peterson, G. E., & Barney, H. L. (1952). Control methods used in a study of the vowels. The Journal of the acoustical society of America, 24(2), 175-184.